http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Derek&feedformat=atomstatwiki - User contributions [US]2024-03-28T09:14:20ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Phrase_Representations&diff=27366learning Phrase Representations2015-12-19T02:03:31Z<p>Derek: /* Scoring Phrase Pairs with RNN Encoder–Decoder */</p>
<hr />
<div>= Introduction =<br />
<br />
In this paper, Cho et al. propose a novel neural network model called RNN Encoder–Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder–Decoder as an additional feature in the existing log-linear model.<br />
<br />
= RNN Encoder–Decoder =<br />
<br />
In this paper, researchers propose a novel neural network architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. From a probabilistic perspective, this new model is a general method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence, e.g. <math>p(y_1, . . . , y_{T'} | x_1, . . . , x_T )</math>, where one should note that the input and output sequence lengths <math>T</math> and <math>T'</math> may differ.<br />
<br />
<center><br />
[[File:encdec1.png |frame | center |Fig 1. An illustration of the proposed RNN Encoder–Decoder. ]]<br />
</center><br />
<br />
The encoder is an RNN that reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes. At each time step t, the hidden state <math>h_{t}</math> of the RNN is updated by:<br />
<br />
::<math> h_t=f(h_{t-1},x_t) </math> <br/><br />
<br />
where ''f'' is a non-linear activation function. ''f'' may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit. After reading the end of the sequence (marked by an end-of-sequence symbol), the hidden state of the RNN is a summary <math>\mathbf{c}</math> of the whole input sequence.<br />
<br />
The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol <math>y_t</math> given the hidden state<math> h_t</math> . However, as shown in figure 1, both <math>y_t</math> and <math>h_t</math> are also conditioned on <math>y_{t-1}</math> and on the summary <math>\mathbf{c}</math> of the input sequence. Hence, the hidden state of the decoder at time <math>t</math> is computed by, <br />
<br />
::<math> h_t=f(h_{t-1},y_{t-1},\mathbf{c}) </math> <br/><br />
<br />
and similarly, the conditional distribution of the next symbol is<br />
<br />
::<math> P(y_t|y_{t-1},y_{t-2},\cdots,y_1,\mathbf{c})=g(h_t,,y_{t-1},\mathbf{c})</math> <br/><br />
<br />
The two components of the proposed RNN Encoder–Decoder are jointly trained to maximize the conditional log-likelihood<br />
<br />
::<math> \max_{\mathbf{\theta}}\frac{1}{N}\sum_{n=1}^{N}\log p_\mathbf{\theta}(\mathbf{y}_n|\mathbf{x}_n) </math> <br/><br />
<br />
where <math> \mathbf{\theta}</math> is the set of the model parameters and each <math>(\mathbf{y}_n,\mathbf{x}_n)</math> is an (input sequence, output sequence) pair from the training set. In our case, as the output of the decoder, starting from the input, is differentiable, the model parameters can be estimated by a gradient-based algorithm. Once the RNN Encoder–Decoder is trained, the model can be used in two ways. One way is to use the model to generate a target sequence given an input sequence. On the other hand, the model can be used to score a given pair of input and output sequences.<br />
<br />
==Hidden Unit that Adaptively Remembers and Forgets==<br />
<br />
This paper also propose a new type of hidden node that has been inspired by LSTM but is much simpler to compute and implement. Fig. 2 shows the graphical depiction of the proposed hidden unit.<br />
<br />
<center><br />
[[File:encdec2.png |frame | center |Fig 2. An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h˜. The reset gate r decides whether the previous hidden state is ignored. ]]<br />
</center><br />
<br />
Mathematically it can be shown as(<math>\sigma</math> is the logistic sigmoid function, <math>[.]_j</math> denotes the j-th element of a vector and <math>\odot</math> means elementwise multiply):<br />
<br />
::<math> r_j=\sigma([\mathbf{W}_r\mathbf{x}]_j+[\mathbf{U}_r\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> z_j=\sigma([\mathbf{W}_z\mathbf{x}]_j+[\mathbf{U}_z\mathbf{h}_{t-1}]_j )</math> <br/><br />
::<math> h_j^{(t)}=z_jh_j^{(t-1)}+(1-z_j)\tilde{h}_j^{(t)}</math> <br/><br />
<br />
where<br />
::<math>\tilde{h}_j^{(t)}=\phi([\mathbf{W}\mathbf{x}]_j+[\mathbf{U}(\mathbf{r}\odot\mathbf{h}_{t-1})]_j )</math> <br/><br />
<br />
In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation. On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember longterm information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit.<ref><br />
Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in optimizing recurrent networks[C]//Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013: 8624-8628.<br />
</ref><br />
<br />
Because each hidden unit has separate gates, it is possible for each hidden to unit to learn to capture dependencies over different lengths of time (determined by the frequency at which its reset and updates gates are active).<br />
<br />
=Scoring Phrase Pairs with RNN Encoder–Decoder =<br />
<br />
In a commonly used statistical machine translation system (SMT), the goal of the system (decoder, specifically) is to find a translation f given a source sentence e, which maximizes the posterior<br />
<br />
<math>p(\mathbf{f} | \mathbf{e})\propto p(\mathbf{e} | \mathbf{f})p(\mathbf{f})</math><br />
<br />
where the first term at the right hand side is called translation model and the latter language model (see, e.g., (Koehn, 2005)). In practice, however, most SMT systems model <math>\log p(\mathbf{f} | \mathbf{e})</math> as a loglinear model with additional features and corresponding weights:<br />
<br />
<math>\log p(\mathbf{f} | \mathbf{e})=\sum_{n=1}^Nw_nf_n(\mathbf{f},\mathbf{e})+\log Z(\mathbf{e})</math><br />
<br />
where <math>f_n</math> and <math>w_n</math> are the n-th feature and weight, respectively. <math>Z(\mathbf{e})</math> is a normalization constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a development set.<br />
<br />
Cho et al. propose to train the RNN Encoder–Decoder on a table of phrase pairs and use its scores as additional features in the log-linear model showed above when tuning the maximum a posterior SMT decoder. For training the RNN-Encoder-Decoder, phrase frequency is ignored for several reasons: to reduce computation time, to ensure the model does not simply rank phrases by frequency, and because frequency information is already encoded in the features for the SMT (so it's better to not use the capacity of the RNN-Encoder-Decoder redundantly).<br />
<br />
=Alternative Models=<br />
The researchers noted a number of other potential translation models and their usability.<br />
<br />
The first model is by Schwenk and it is an application of a variant of the continuous space language model to the task of machine translation. The model is essentially a feedforward neural network with a common projection for input words encoded as bag of words vectors. Schwenk fixed the input and output sentence length and for a fixed length, the neural network attempts to estimate the probability of the output sequence of words and score potential translations. However, a major disadvantage is that the input and output length are fixed and cannot handle variable length inputs or outputs.<br />
<br />
The model figure<ref><br />
[Schwenk2012] Holger Schwenk. 2012. Continuous<br />
space translation models for phrase-based statistical<br />
machine translation. In Martin Kay and Christian<br />
Boitet, editors, Proceedings of the 24th International<br />
Conference on Computational Linguistics<br />
(COLIN), pages 1071–1080.<br />
</ref>:<br />
<br />
[[File:CONTINUOUS.PNG]]<br />
<br />
Another model, similar to Schwenk's, is by Devlin and a feedforward neural network is also used. Rather than estimating the probability of the entire output sequence of words in Schwenk's model, Devlin only estimates the probability of the next word and uses both a portion of the input sentence and a portion of the output sentence. It reported impressive improvements but similar to Schwenk, it fixes the length of input prior to training.<br />
<br />
Chandar et al. trained a feedforward neural network to learn a mapping from a bag-of-words representation of an input phrase to an output phrase.<ref><br />
Lauly, Stanislas, et al. "An autoencoder approach to learning bilingual word representations." Advances in Neural Information Processing Systems. 2014.<br />
</ref> This is closely related to both the proposed RNN Encoder–Decoder and the model<br />
proposed by Schwenk, except that their input representation of a phrase is a bag-of-words. A similar approach of using bag-of-words representations was proposed by Gao<ref><br />
Gao, Jianfeng, et al. "Learning semantic representations for the phrase translation model." arXiv preprint arXiv:1312.0482 (2013).<br />
</ref> as well. One important difference between the proposed RNN Encoder–Decoder and the above approaches is that the order of the words in source and target phrases is taken into account. The RNN Encoder–Decoder naturally distinguishes between sequences that have the same words but in a different order, whereas the aforementioned approaches effectively ignore order information.<br />
<br />
=Experiments =<br />
<br />
The model is evaluated on the English/French translation task of the WMT’14 workshop. In building the model Cho et al. used baseline phrase-based SMT system and a Neural Language Model(CSLM)<ref><br />
Schwenk H, Costa-Jussa M R, Fonollosa J A R. Continuous space language models for the IWSLT 2006 task[C]//IWSLT. 2006: 166-173.<br />
</ref><br />
<br />
The following model combinations were tested:<br />
# Baseline configuration<br />
# Baseline + RNN<br />
# Baseline + CSLM + RNN<br />
# Baseline + CSLM + RNN + Word penalty<br />
<br />
The results are shown in Figure 3. The RNN encoder-decoder consisted of 1000 hidden units. Rank-100 matrices were used to connect the input to the hidden unit. The "word penalty" attempts to penalize the words unknown to the neural network, which is accomplished by using the number of unknown words as a feature in the log-linear model above. <br />
<br />
<center><br />
[[File:encdec3.png |frame | center |Fig 3. BLEU scores computed on the development and test sets using different combinations of approaches. WP denotes a word penalty, where we penalizes the number of unknown words to neural networks. ]]<br />
</center><br />
<br />
The best performance was achieved when we used both CSLM and the phrase scores from the RNN Encoder–Decoder. This suggests that the contributions of the CSLM and the RNN Encoder– Decoder are not too correlated and that one can expect better results by improving each method independently<br />
<br />
<br />
== Word and Phrase Representations ==<br />
<br />
As the presented model maps sentences into a continuous space vector and prior continuous space language models have been known to learn semantically meaningful embeddings, one could expect this to happen for the presented model, too. This is indeed the case. When projecting to a 2D space (with Barnes-Hut-SNE), semantically similar words are clearly clustered.<br />
<br />
[[File:Fig4.png]]<br />
<br />
Phrases are also clustered capturing both semantic and syntactic structures.<br />
<br />
[[File:Fig5.png]]<br />
<br />
= References=<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=joint_training_of_a_convolutional_network_and_a_graphical_model_for_human_pose_estimation&diff=27365joint training of a convolutional network and a graphical model for human pose estimation2015-12-19T01:01:23Z<p>Derek: /* Higher-Level Spatial-Model */</p>
<hr />
<div>== Introduction ==<br />
<br />
Human body pose estimation, or specifically the localization of human joints in monocular RGB images, remains a very challenging task in computer vision. Recent approaches to this problem fall into two broad categories: traditional deformable part models and deep-learning based discriminative models. Traditional models rely on the aggregation of hand-crafted low-level features and then use a standard classifier or a higher level generative model to detect the pose, which require the features to be sensitive enough and invariant to deformations. Deep learning approaches learn an empirical set of low and high-level features which are more tolerant to variations. However, it’s difficult to incorporate prior knowledge about the structure of the human body.<br />
<br />
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network Part-Detector and a part-based Spatial-Model. In other words, a deep convolutional neural network is combined with a graphical models, in order to capture the spatial dependencies between the variables of interest which is done using a joint-training process. This combination and joint training significantly outperforms existing state-of-the-art models on the task of human body pose recognition.<br />
<br />
== Model ==<br />
=== Convolutional Network Part-Detector ===<br />
<br />
They combine an efficient ConvNet architecture with multi-resolution and overlapping receptive fields, which is shown in the figure below.<br />
<br />
[[File:architecture1.PNG | center]]<br />
<br />
Traditionally, in image processing tasks such as these, a Laplacian Pyramid<ref><br />
[https://en.wikipedia.org/wiki/Pyramid_(image_processing)#Gaussian_pyramid "Pyramid (image processing)"]<br />
</ref> of three resolution banks is used to provide each bank with non-overlapping spectral content. Then the Local Contrast Normalization (LCN<ref><br />
Collobert R, Kavukcuoglu K, Farabet C.[http://infoscience.epfl.ch/record/192376/files/Collobert_NIPSWORKSHOP_2011.pdf Torch7: A matlab-like environment for machine learning] BigLearn, NIPS Workshop. 2011 (EPFL-CONF-192376).<br />
</ref>) is applied to those input images. However, in this model, only a full image stage and a half-resolution stage was used, allowing for a simpler architecture and faster training.<br />
<br />
Although, a sliding window architecture is usually used for this type of task, it has the down side of creating redundant convolutions. Instead, in this network, for each resolution bank, ConvNet architecture with overlapping receptive fields is used to get a heat-map as output, which produces a per-pixel likelihood for key joint locations on the human skeleton.<br />
<br />
The following figure shows a Efficient Sliding Window Model with Overlapping Receptive Fields,<br />
<br />
[[File:Qq1.png | center]]<br />
<br />
The convolution results (feature maps) of the low resolution bank are upscaled and interleaved with those of high resolution bank. Then, these dense feature maps are processed through convolution stages at each pixel, which is equivalent to fully-connected network model but more efficient.<br />
<br />
Supervised training of the network is performed using batched Stochastic Gradient Descent (SGD) with Nesterov Momentum. <br />
<br />
Nesterov momentum can be written as<ref><br />
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance<br />
of initialization and momentum in deep learning. In Proceedings of the 30th International<br />
Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28<br />
of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013.<br />
</ref>:<br />
<br />
[[File:Nmomentum.PNG]]<br />
<br />
Rather than adding each set of gradients from the stochastic batch process separately, a velocity vector is instead accumulated at some rate <math>\,\mu</math> so that if the gradient descent process continuously travel in the same general direction, then this velocity vector would increase over each successive descent and travel faster towards that direction than conventional gradient descent. This should increase the convergence rate and decrease number of epochs needed to converge to some local minima. Nesterov momentum does make one modification and that is to correct the direction of the velocity vector with <math>\,\epsilon\triangledown f(\theta_t+\mu v_t)</math> not at the current position, but at the future predicted position. The difference can be seen in the figure<ref><br />
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance<br />
of initialization and momentum in deep learning. In Proceedings of the 30th International<br />
Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28<br />
of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013.<br />
</ref>:<br />
<br />
[[File:Moment.PNG]]<br />
<br />
This correction lets the descent direction be more sensitive to changes in directions and increases stability. This can be seen as looking at the future gradient to evaluate the suitability of the current gradient direction. This is evident in the figure where the first changes direction based purely at the current position and the second corrects the direction based on the next gradient.<br />
<br />
They use a Mean Squared Error (MSE) criterion to minimize the distance between the predicted output and a target heat-map. At training time they also perform random perturbations of the input images (randomly flipping and scaling the images) to increase generalization performance.<br />
<br />
=== Higher-Level Spatial-Model ===<br />
<br />
They use a higher-level Spatial-Model to get rid of false positive outliers and anatomically incorrect poses predicted by the Part-Detector, constraining joint inter-connectivity and enforcing global pose consistency.<br />
<br />
They formulate the Spatial-Model as an Markov Random Field (MRF)-like model over the distribution of spatial locations for each body part. MRFs are undirected probabilistic graphical models; a conditional-independence structure not enforcing directionality in dependence relations between variables. After the unary potentials for each body part location are provided by the Part-Detector, the pair-wise potentials in the graph are computed using convolutional priors, which model the conditional distribution of the location of one body part to another. For instance, the final marginal likelihood for a body part A can be calculated as:<br />
<br />
<math>\bar{p}_{A}=\frac{1}{Z}\prod_{v\in V}^{ }\left ( p_{A|v}*p_{v}+b_{v\rightarrow A} \right )</math><br />
<br />
Where <math>v</math> is the joint location, <math>p_{A|v}</math> is the conditional prior which is the likelihood of the body part A occurring in pixel location (i, j) when joint <math>v</math> is located at the center pixel, <math>b_{v\rightarrow A}</math> is a bias term used to describe the background probability for the message from joint <math>v</math> to A, and Z is the partition function. The learned pair-wise distributions are purely uniform when any pairwise edge should be removed from the graph structure. The above equation is analogous to a single round of sum-product belief propagation. Convergence to a global optimum is not guaranteed given that this spatial model is not tree structured. However, the inferred solution is sufficiently accurate for all poses in datasets used in this research.<br />
<br />
For their practical implementation, they treat the distributions above as energies to avoid the evaluation of Z in the equation before. Their final model is<br />
<br />
<math>\bar{e}_{A}=\mathrm{exp}\left ( \sum_{v\in V}^{ }\left [ \mathrm{log}\left ( \mathrm{SoftPlus}\left ( e_{A|v} \right )*\mathrm{ReLU}\left ( e_{v} \right )+\mathrm{SoftPlus}\left ( b_{v\rightarrow A} \right ) \right ) \right ] \right )</math><br />
<br />
<math>\mathrm{where:SoftPlus}\left ( x \right )=\frac{1}{\beta }\mathrm{log}\left ( 1+\mathrm{exp}\left ( \beta x \right ) \right ), 0.5\leq \beta \leq 2</math><br />
<br />
<br />
<math>\mathrm{ReLU}\left ( x \right )=\mathrm{max}\left ( x,\epsilon \right ), 0< \epsilon \leq 0.01</math><br />
<br />
This model replaces the outer multiplication of final marginal likelihood with a log space addition to improve numerical stability and to prevent coupling of the convolution output gradients (the addition in log space means that the partial derivative of the loss function with respect to the convolution output is not dependent on the output of any other stages).<br />With this modified formulation, the equation can be trained by using back-propagation and SGD. The network-based implementation of the equation is shown below.<br />
<br />
[[File:architecture2.PNG | center]]<br />
<br />
The convolution sizes are adjusted so that the largest joint displacement is covered within the convolution<br />
window. For the 90x60 pixel heat-map output, this results in large 128x128 convolution<br />
kernels to account for a joint displacement radius of 64 pixels (padding is added on the<br />
heat-map input to prevent pixel loss).<br />
The convolution kernels they use in this step is quite large, thus they apply FFT convolutions based on the GPU, which is introduced by Mathieu et al.<ref><br />
Mathieu M, Henaff M, LeCun Y.[http://arxiv.org/pdf/1312.5851.pdf Fast training of convolutional networks through ffts] arXiv preprint arXiv:1312.5851, 2013.<br />
</ref>.The convolution weights are initialized using the empirical histogram of joint displacements created from the training examples. Moreover, during training they randomly flip and scale the heat-map inputs to improve generalization performance.The motivation for this approach is that using multiple scales<br />
may help capturing contextual information.<br />
<br />
=== Unified Model ===<br />
<br />
They first train the Part-Detector separately and store the heat-map outputs, then use these heat-maps to train a Spatial-Model. Finally, they combine the trained Part-Detector and Spatial-Models and back-propagate through the entire network, which further improves performance.<br />
Because the SpatialModel is able to effectively reduce the output dimension of possible heat-map activations, the PartDetector can use available learning capacity to better localize the precise target activation.<br />
<br />
== Results ==<br />
<br />
They evaluated their architecture on the FLIC and extended-LSP datasets. The FLIC dataset is comprised of 5003 images from Hollywood movies with actors in predominantly front-facing standing up poses, while the extended-LSP dataset contains a wider variety of poses of athletes playing sport. They also proposed a new dataset called FLIC-plus<ref><br />
[http://cims.nyu.edu/~tompson/flic_plus.htm "FLIC-plus Dataset"]<br />
</ref> which is fairer than FLIC-full dataset.<br />
<br />
Their model’s performance on the FLIC test-set for the elbow and wrist joints is shown below. It’s trained by using both the FLIC and FLIC-plus training sets.<br />
<br />
[[File:result1.PNG | center]]<br />
<br />
Performance on the LSP dataset is shown here.<br />
<br />
[[File:result2.PNG | center]]<br />
<br />
Since the LSP dataset cover a larger range of the possible poses, their Spatial-Model is less effective. The accuracy for this dataset is lower than FLIC. They believe that increasing the size of the training set will improve performance for these difficult cases.<br />
<br />
The following figure shows the predicted joint locations for a variety of inputs in the FLIC and LSP test-sets. The<br />
network produces convincing results on the FLIC dataset (with low joint position error), however,<br />
because the simple Spatial-Model is less effective for a number of the highly articulated poses in<br />
the LSP dataset, the detector results in incorrect joint predictions for some images. Increasing the size of the training set will improve performance for these difficult cases.<br />
<br />
[[File:M2.png | center]]<br />
<br />
== Conclusion ==<br />
<br />
In this paper a one step message passing is implemented as a convolution operation in order to incorporate spatial relationship between local detection responses for human body pose estimation.This paper shows that the unification of a novel ConvNet Part-Detector and an MRF inspired SpatialModel<br />
into a single learning framework significantly outperforms existing architectures on the task<br />
of human body pose recognition. Training and inference of the architecture uses commodity level<br />
hardware and runs at close to real-time frame rates, making this technique tractable for a wide variety<br />
of application areas.<br />
<br />
== Bibliography ==<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27364deep Learning of the tissue-regulated splicing code2015-12-19T00:54:08Z<p>Derek: /* Training the model */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue-dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with back-propagation method. <br />
<br />
The DNN weights were initialized with small random values sampled from a standard Gaussian distribution. Learning was performed with stochastic gradient descent with momentum and dropout, where mini-batches were constructed. A small L1 weight penalty was included in the cost function. The model’s weights were updated after each mini-batch. The learning rate was decreased with epochs <math>\epsilon</math>, and also included a momentum term <math>\mu</math> that starts out at 0.5, increasing to 0.99, and then stays fixed. The weights of the model parameters <math>\theta</math> were updated as follows:<br /><br />
<br />
::: <math> \, \theta_e = \theta_{e-1} + \Delta \theta_e </math><br />
<br />
::: <math> \Delta\theta_e = \mu_e\Delta\theta_{e-1} - (1-\mu_e)\epsilon_e\nabla E(\theta_e) </math><br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. The training examples are constructed with some redundancy (i.e., in some of the training examples the two tissues are identical) so the model will learn this without it having to be be explicitly specified.<br />
<br />
The batches for training were biased such that earlier batches contain 4/5 samples with higher tissues variability and 1/5 with low tissue variablity. After the high-variability examples are all used, the batches randomly select from the remaining lower-variability examples. The stated purpose is to give examples with high-tissue variability greater importance, while avoiding over-fitting by having them early in the training.<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=deep_Learning_of_the_tissue-regulated_splicing_code&diff=27363deep Learning of the tissue-regulated splicing code2015-12-19T00:50:24Z<p>Derek: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Alternative splicing(AS) is a regulated process during gene expression that enables the same gene to give rise to splicing isoforms containing different combinations of exons, which leads to different protein products. Furthermore, AS is often tissue-dependent. This paper mainly focus on performing Deep Neural Network (DNN) in predicting outcome of splicing, and compare the performance to formerly trained model Bayesian Neural Network<ref>https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf</ref> (BNN), and Multinomial Logistic Regression<ref>https://en.wikipedia.org/wiki/Multinomial_logistic_regression</ref> (MLR). <br />
<br />
A huge difference that the author imposed in DNN is that each tissue type are treated as an input; while in previous BNN, each tissue type was considered as a different output of the neural network. Moreover, in previous work, the splicing code infers the direction of change of the percentage of transcripts with an exon spliced in (PSI). Now, this paper perform absolute PSI prediction for each tissue individually without averaging across tissues, and also predict the difference PSI (<math>\Delta</math>PSI) between pairs of tissues. Apart from regular deep neural network, this model will train these two prediction tasks simultaneously.<br />
<br />
= Model =<br />
<br />
The dataset consists of 11019 mouse alternative exons profiled from RNA-Seq<ref>https://en.wikipedia.org/wiki/RNA-Seq</ref> Data. Five tissue types are available, including brain, heart, kidney, liver and testis. <br />
<br />
The DNN is fully connected, with multiple layers of non-linearity consisting of hidden units. The mathematical expression of model is below:<br />
<br />
::::::: <math>{a_v}^l = f(\sum_{m}^{M^{l-1}}{\theta_{v,m}^{l}a_m^{l-1}})</math> <br />
:::::::where a is the weighted sum of outputs from the previous layer. <math>\theta_{v,m}^{l}</math> is the weights between layers. <br />
<br />
::::::: <math>f_{RELU}(z)=max(0,z)</math><br />
::::::: The RELU unit was used for all hidden units except for the first hidden layer, which uses TANH units.<br />
<br />
::::::: <math>h_k=\frac{exp(\sum_m{\theta_{k,m}^{last}a_m^{last}})}{\sum_{k'}{exp(\sum_{m}{\theta_{k',m}^{last}a_m^{last}})}}</math><br />
::::::: this is the softmax function of the last layer. <br />
<br />
The cost function we want to minimize here during training is <math>E=-\sum_a\sum_{k=1}^{C}{y_{n,k}log(h{n,k})}</math>, where <math>n</math> denotes the training example, and <math>k</math> indexes <math>C</math> classes. <br />
<br />
The identity of two tissues are then appended to the vector of outputs of the first hidden layer, together forming the input into the second hidden layer. The identity is a 1-of-5 binary variables in this case. (Demonstrated in Fig.1) The first targets for training contains three classes, which labeled as ''low'', ''medium'', ''high'' (LMH code). The second task describes the <math>\Delta PSI</math> between two tissues for a particular exon. The three classes corresponds to this task is ''decreased inclusion'', ''no change'' and ''increased inclusion'' (DNI code).Both the LMH and DNI codes are trained jointly, reusing the same hidden representations learned by the model. The DNN used backpropagation with dropout to train the data, and used different learning rates for two tasks. <br />
<br />
[[File: Modell.png]]<br />
<br />
= Training the model =<br />
<br />
The first hidden layer was trained as an autoencoder to reduce the dimensionality of the feature in an unsupervised manner. This method of pretraining the network has been used in deep architecture to initialize learning near a good local minimum. In the second stage of training, the weights from the input layer to the first hidden layer are fixed, and 10 additional inputs corresponding to tissues are appended. The vector representation for tissue is a binary vector. For example, it takes the form [0 1 0 0 0] to denote the second tissue out of five possible types. Moreover, the weights connected to the rest hidden layers of the DNN are then trained together in a supervised layers with backpropagation method. <br />
<br />
The DNN weights were initialized with small random values sampled from a zero-mean Gaussian distribution. Learning was performed with stochastic gradient descent with momentum and dropout, where mini-batches were constructed. A small L1 weight penalty was included in the cost function. The model’s weights were updated after each mini-batch. The learning rate was decreased with epochs <math>\epsilon</math>, and also included a momentum term <math>\mu</math> that starts out at 0.5, increasing to 0.99, and then stays fixed. The weights of the model parameters <math>\theta</math> were updated as follows:<br /><br />
<br />
::: <math> \, \theta_e = \theta_{e-1} + \Delta \theta_e </math><br />
<br />
::: <math> \Delta\theta_e = \mu_e\Delta\theta_{e-1} - (1-\mu_e)\epsilon_e\nabla E(\theta_e) </math><br />
<br />
In addition, they filtered the data first before training by excluding examples if the total number RNA-Seq junction reads is below 10. This removed 45.8% of the total number of training examples. <br />
<br />
Both the LMH and DNI codes are trained together. Because each of these two tasks might be learning at different rates. This is to prevent one task from overfitting too soon and negatively affecting the performance of another task before the complete model is fully trained. <br />
<br />
The targets consist of (i) PSI for each of the two tissues and (ii) <math> \Delta PSI </math> between the two tissues. As a result, given same tissues, the model should predict no change for <math> \Delta PSI </math>. Also, if the tissues are swapped in the input, the previous increased inclusion label should become decrease. The training examples are constructed with some redundancy (i.e., in some of the training examples the two tissues are identical) so the model will learn this without it having to be be explicitly specified.<br />
<br />
The batches for training were biased such that earlier batches contain 4/5 samples with higher tissues variability and 1/5 with low tissue variablity. After the high-variability examples are all used, the batches randomly select from the remaining lower-variability examples. The stated purpose is to give examples with high-tissue variability greater importance, while avoiding over-fitting by having them early in the training.<br />
<br />
= Performance comparison =<br />
<br />
The performance of the model was assessed using the area under the Receiver-Operating Characteristic curve (AUC) metric. This paper compared three methods through the same baseline, DNN, BNN and MLR. <br />
<br />
The result (LMH code) shows in the table below. Table 1a reports AUC for PSI predictions from the LMH code on all tissues; while 1b reports AUC evaluated on the subset of events that exhibit large tissue variability. From 1a, the performance of DNN in ''low'' and ''high'' categories are comparable with the BNN, but outperformed at the ''medium'' level. From 1b, DNN significantly outperformed BNN and MLR. In both comparison, MLR performed poorly. <br />
<br />
[[File: LMH.png]]<br />
<br />
Next, we look at how well the different methods can predict <math>\Delta PSI</math> (DNI code). DNN predicts LMH code and DNI code at the same time; while in BNN, the model can only predict LMH code. Thus, for a fair comparison. author used a MLR on the predicted outputs for each tissue pair from BNN and similarly trained MLR on the LMH outputs of the DNN. Table 2 shows that both DNN and DNN+MLR outperformed the BNN+MLR or MLR. <br />
<br />
[[File: DNI.png]]<br />
<br />
<br />
'''Why did DNN outperform?'''<br />
<br />
1. The use of tissue types as an input freature, which stringently required the model's hidden representations be in a form that can be well-modulated by information specifying the different tissue types for splicing pattern prediction. <br />
<br />
2. The model is described by thousands of hidden units and multiple layers of non-linearity. In contrast, BNN only has 30 hidden units, which may not be sufficient. <br />
<br />
3. A hyperparameter search is performed to optimize the DNN.<br />
<br />
4. The use of dropout, which contributed ~1-6% improvement in the LMH code for different tissues, and ~2-7% in the DNI code, compared with without dropout.<br />
<br />
5. Training was biased toward the tissue-specific events (by construction of minibatches).<br />
<br />
= Conclusion =<br />
<br />
This work shows that DNN can also be used in a sparse biological dataset. Furthermore, the input features can be analyzed in terms of the predictions of the model to gain some insights into the inferred tissue-regulated splicing code. This architecture can easily be extended to the case of more data from different sources.<br />
<br />
= reference =<br />
<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=from_Machine_Learning_to_Machine_Reasoning&diff=27362from Machine Learning to Machine Reasoning2015-12-19T00:46:39Z<p>Derek: /* Reasoning Systems */</p>
<hr />
<div>== Introduction ==<br />
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.<br />
<br />
Humans display neither of these limitations.<br />
<br />
The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.<br />
<br />
This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.<br />
<br />
This approach is explored along a number of auxiliary tasks.<br />
<br />
== Auxiliary Tasks ==<br />
<br />
The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.<br />
<br />
'''Face-based Identification'''<br />
<br />
Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.<br />
<br />
Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.<br />
<br />
[[File:figure1.JPG | center]]<br />
<br />
'''Natural Language Processing'''<br />
<br />
The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.<br />
<br />
[[File:word_transfer.png | center]]<br />
<br />
== Reasoning Revisited ==<br />
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.<br />
<br />
We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".<br />
<br />
Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.<br />
<br />
[[File:figure5.JPG | center]]<br />
<br />
== Probabilistic Models ==<br />
Graphical models describe the factorization of joint probability distributions into lower-dimensional conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW><br />
Buntine, Wray L [http://arxiv.org/pdf/cs/9412102.pdf"Operations for learning with graphical models"] in The Journal of Artificial Intelligence Research, (1994).<br />
</ref> compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.<br />
<br />
== Reasoning Systems ==<br />
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".<br />
<br />
Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:<br />
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.<br />
<br />
*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.<br />
<br />
*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.<br />
<br />
*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.<br />
<br />
*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.<br />
<br />
*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.<br />
<br />
*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities, as their reliability cannot be ascertained. <br />
<br />
It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.<br />
<br />
The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.<br />
<br />
The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.<br />
<br />
== Association and Dissociation ==<br />
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.<br />
<br />
[[File:figure6.JPG | center]]<br />
<br />
There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.<br />
<br />
[[File:figure7.JPG | center]]<br />
<br />
The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.<br />
<br />
[[File:figure8.JPG | center]]<br />
<br />
In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.<br />
<br />
[[File:figure9.JPG | center]]<br />
<br />
The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.<br />
<br />
The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref><br />
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]<br />
</ref><br />
<br />
[[File:figure10.JPG | center]]<br />
<br />
Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation. This poor performance is due to fixed geometry of their spatial pooling layers. The lower layers aggregate the local features based on a predefined pattern and pass them to upper levels/ this aggregation causes poor spatial and orientation accuracy. One approach for resolving this drawback is parsing mechanism where intermediate representations can be attached to the image patches of image. <br />
<br />
The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref><br />
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]<br />
</ref><ref><br />
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]<br />
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time). <br />
<br />
<br />
[[File:figure11.JPG | center]]<br />
<br />
Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.<br />
<br />
== Universal Parser ==<br />
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.<br />
<br />
[[File:figure12.JPG | center]]<br />
<br />
The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.<br />
<br />
== More Modules ==<br />
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.<br />
<br />
*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.<br />
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.<br />
<br />
== Representation Space ==<br />
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.<br />
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.<br />
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.<br />
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables. By this representation, the learning algorithms can be expressed as stochastic sampling in which sampling image at regular spaced locations replaced by the sampling at non-uniform spaced locations. Gibbs sampling or Markov-chain Monte-Carlo are two prominent technique for this purpose.<br />
<br />
== Conclusions ==<br />
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.<br />
<br />
== Bibliography ==<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=from_Machine_Learning_to_Machine_Reasoning&diff=27361from Machine Learning to Machine Reasoning2015-12-19T00:44:52Z<p>Derek: /* Probabilistic Models */</p>
<hr />
<div>== Introduction ==<br />
Learning and reasoning are both essential abilities associated with intelligence. Consequently, machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning are much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. [http://www.math.snu.ac.kr/~hichoi/infomath/Articles/Lighthill%20Report.pdf "Artificial intelligence: a general survey."] In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. [http://bayes.cs.ucla.edu/BOOK-2K/neuberg-review.pdf "Causality: models, reasoning, and inference."] Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6074&rep=rep1&type=pdf "On the hardness of approximate reasoning"] Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.<br />
<br />
Humans display neither of these limitations.<br />
<br />
The ability to reason is not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.<br />
<br />
This paper shows how deep learning and multi-task learning can be leveraged as a rudimentary form of reasoning to help solve a task of interest.<br />
<br />
This approach is explored along a number of auxiliary tasks.<br />
<br />
== Auxiliary Tasks ==<br />
<br />
The usefulness of auxiliary tasks were examined within the contexts of two problems; face-based identification and natural language processing. Both these examples show how an easier task (determining whether two faces are different) can be used to boost performance on a harder task (identifying faces) using inference.<br />
<br />
'''Face-based Identification'''<br />
<br />
Identifying a person from face images is challenging. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.<br />
<br />
Figure below illustrates a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.<br />
<br />
[[File:figure1.JPG | center]]<br />
<br />
'''Natural Language Processing'''<br />
<br />
The auxiliary task in this case (left diagram of figure below) is identifying if a sentence is correct or not. This creates embedding for works in a 50 dimensional space. This embedding can than be used on the primary problem (right diagram of the figure below) of producing tags for the works. Note the shared classification "W" modules shared between the tasks.<br />
<br />
[[File:word_transfer.png | center]]<br />
<br />
== Reasoning Revisited ==<br />
Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.<br />
<br />
We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".<br />
<br />
Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. [http://www.iro.umontreal.ca/~lisa/pointeurs/bottou-lecun-bengio-97.pdf "Global training of document processing systems using graph transformer networks."] In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.<br />
<br />
[[File:figure5.JPG | center]]<br />
<br />
== Probabilistic Models ==<br />
Graphical models describe the factorization of joint probability distributions into lower-dimensional conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables. Many refinements have been devised to make the parametrization more explicit. The plate notation<ref name=BuW><br />
Buntine, Wray L [http://arxiv.org/pdf/cs/9412102.pdf"Operations for learning with graphical models"] in The Journal of Artificial Intelligence Research, (1994).<br />
</ref> compactly represents large graphical models with repeated structures that usually share parameters. More recent works propose considerably richer languages to describe large graphical probabilistic models. Such high order languages for describing probabilistic models are expressions of the composition rules described in the previous section.<br />
<br />
== Reasoning Systems ==<br />
We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (a) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".<br />
<br />
Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:<br />
*''First order logic reasoning'' - Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W.[https://www.math.uwaterloo.ca/~snburris/htdocs/scav/hilbert/hilbert.html "Grundzüge der theoretischen Logik."] Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.<br />
<br />
*''Probabilistic reasoning'' - Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.<br />
<br />
*''Causal reasoning'' - The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.<br />
<br />
*''Newtonian Mechanics'' - Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.<br />
<br />
*''Spatial reasoning'' - A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.<br />
<br />
*''Social reasoning'' - Changes of viewpoints also play a very important role in social interactions.<br />
<br />
*''Non-falsifiable reasoning'' - Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.<br />
<br />
It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.<br />
<br />
The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.<br />
<br />
The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.<br />
<br />
== Association and Dissociation ==<br />
We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the association of the two inputs. Given a sentence segment composed of ''n'' words, the figure below shows how ''n-1'' applications of the association module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.<br />
<br />
[[File:figure6.JPG | center]]<br />
<br />
There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "''((the cat) (sat (on (the mat))''". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.<br />
<br />
[[File:figure7.JPG | center]]<br />
<br />
The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. [https://aclweb.org/anthology/P/P07/P07-1071.pdf "Fast semantic extraction using a novel neural network architecture."] In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. [http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf "Deep learning for efficient discriminative parsing."] In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.<br />
<br />
[[File:figure8.JPG | center]]<br />
<br />
In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.<br />
<br />
[[File:figure9.JPG | center]]<br />
<br />
The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.<br />
<br />
The association and dissociation modules can be seen similar to the <code>cons</code>, <code>car</code>, and <code>cdr</code> primitives of the Lisp programming languages. These statements are used to construct new objects from two individual objects (<code>cons</code>, "association") or extract the individual objects (<code>car</code> and <code>cdr</code>, "dissociation") from a constructed object. However, there is an important difference. The representation in Lisp is discrete, whereas the representation here is in a continuous vector space. This will limit the depth of structures that can be constructed (because of limited numerical precision), while at the same time it makes other vectors in numerical proximity of a representation also meaningful. This latter property makes search algorithms more efficient as it is possible to follow a gradient (instead of performing discrete jumps). Note that the presented idea of association and dissociation in a vector space is very similar to what is known as Vector Symbolic Architectures.<ref><br />
[http://arxiv.org/abs/cs/0412059 Gayler, Ross W. "Vector symbolic architectures answer Jackendoff's challenges for cognitive neuroscience." arXiv preprint cs/0412059 (2004).]<br />
</ref><br />
<br />
[[File:figure10.JPG | center]]<br />
<br />
Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation. This poor performance is due to fixed geometry of their spatial pooling layers. The lower layers aggregate the local features based on a predefined pattern and pass them to upper levels/ this aggregation causes poor spatial and orientation accuracy. One approach for resolving this drawback is parsing mechanism where intermediate representations can be attached to the image patches of image. <br />
<br />
The use of the association-dissociation modules of sort described in this section have been given more a general treatment in recent work on recursive neural networks, which similarly apply a single function to a sequence of inputs in a pairwise fashion to build up distributed representations of data (e.g. natural language sentences or segmented images).<ref><br />
[http://www.socher.org/uploads/Main/SocherHuvalManningNg_EMNLP2012.pdf Socher, R. et al. "Semantic compositionally though recursive matrix-vector spaces" EMNLP (2012).]<br />
</ref><ref><br />
[http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf Socher, R. et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" EMNLP (2013).]<br />
</ref>. A standard recurrent network can also be thought of as a special case of this approach in which the recursive application always proceeds left to right through the input sequence (i.e. there is no branching in the tree produced by unfolding the recursion through time). <br />
<br />
<br />
[[File:figure11.JPG | center]]<br />
<br />
Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.<br />
<br />
== Universal Parser ==<br />
The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.<br />
<br />
[[File:figure12.JPG | center]]<br />
<br />
The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.<br />
<br />
== More Modules ==<br />
The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.<br />
<br />
*Operator grammars <ref>Harris, Z. S. [https://books.google.ca/books/about/Mathematical_structures_of_language.html?id=qsbuAAAAMAAJ&redir_esc=y "Mathematical structures of language."] Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.<br />
*There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.<br />
<br />
== Representation Space ==<br />
Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.<br />
*In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.<br />
*Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.<br />
*The representation space can also be a space of probability distributions defined on a vector of discrete random variables. By this representation, the learning algorithms can be expressed as stochastic sampling in which sampling image at regular spaced locations replaced by the sampling at non-uniform spaced locations. Gibbs sampling or Markov-chain Monte-Carlo are two prominent technique for this purpose.<br />
<br />
== Conclusions ==<br />
The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.<br />
<br />
== Bibliography ==<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Manifold_Tangent_Classifier&diff=27360the Manifold Tangent Classifier2015-12-19T00:31:04Z<p>Derek: /* Discussion */</p>
<hr />
<div>== Introduction ==<br />
<br />
The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., & Muller, X. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_1240.pdf The manifold tangent classifier.] In Advances in Neural Information Processing Systems (pp. 2294-2302). </ref> These algorithms are designed to work on numerous problems which they may not be specifically tailored towards, thus domain-specific knowledge is generally not incorporated into the models. However, some generic "prior" hypotheses are considered to aid in the general task of learning, and three very common ones are presented below:<br />
<br />
# The '''semi-supervised learning hypothesis''': This states that knowledge of the input distribution <math>p\left(x\right)</math> can aid in learning the output distribution <math>p\left(y|x\right)</math> .<ref>Lasserre, J., Bishop, C. M., & Minka, T. P. (2006, June). [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1640745 Principled hybrids of generative and discriminative models.] In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 87-94). IEEE.</ref> This hypothesis lends credence to not only the theory of strict semi-supervised learning, but also unsupervised pretraining as a method of feature extraction.<ref> Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 A fast learning algorithm for deep belief nets.] Neural computation, 18(7), 1527-1554.</ref><ref>Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). [http://delivery.acm.org/10.1145/1760000/1756025/p625-erhan.pdf?ip=129.97.89.222&id=1756025&acc=PUBLIC&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=561475515&CFTOKEN=96787671&__acm__=1447710319_1ea806f74c2b3b6959e97d9d0e03d533 Why does unsupervised pre-training help deep learning?.] The Journal of Machine Learning Research, 11, 625-660.</ref><br />
# The '''unsupervised manifold hypothesis''': This states that real-world data presented in high-dimensional spaces is likely to concentrate around a low-dimensional sub-manifold.<ref>Cayton, L. (2005). [http://www.vis.lbl.gov/~romano/mlgroup/papers/manifold-learning.pdf Algorithms for manifold learning.] Univ. of California at San Diego Tech. Rep, 1-17.</ref><br />
# The '''manifold hypothesis for classification''': This states that points of different classes are likely to concentrate along different sub-manifolds, separated by low-density regions of the input space.<ref name = "main"></ref><br />
<br />
The recently-proposed Contractive Auto-Encoder (CAE) algorithm has shown success in the task of unsupervised feature extraction,<ref name = "CAE">Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf Contractive auto-encoders: Explicit invariance during feature extraction.] In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 833-840).</ref> with its successful application in pre-training of Deep Neural Networks (DNN) an illustration of the merits of adopting '''Hypothesis 1'''. CAE also yields a mostly contractive mapping that is locally only sensitive to a few input directions, which implies that it models a lower-dimensional manifold (exploiting '''Hypothesis 2''') since the directions of sensitivity are in the tangent space of the manifold. <br />
<br />
This paper furthers the previous work by using the information about the tangent spaces by considering '''Hypothesis 3''': it extracts basis vectors for the local tangent space around each training point from the parameters of the CAE. Then, older supervised classification algorithms that exploit tangent directions as domain-specific prior knowledge can be used on the tangent spaces generated by CAE for fine-tuning the overall classification network. This approach seamlessly integrates all three of the above hypotheses and produces record-breaking results (for 2011) on image classification.<br />
<br />
== Contractive Auto-Encoders (CAE) and Tangent Classification ==<br />
<br />
The problem is to find a non-linear feature extractor for a dataset <math>\mathcal{D} = \{x_1, \ldots, x_n\}</math>, where <math>x_i \in \mathbb{R}^d</math> are i.i.d. samples from an unknown distribution <math> p\left(x\right)</math>.<br />
<br />
=== Traditional Auto-Encoders === <br />
<br />
A traditional auto-encoder learns an '''encoder''' function <math>h: \mathbb{R}^d \rightarrow \mathbb{R}^{d_h}</math> along with a '''decoder''' function <math>g: \mathbb{R}^{d_h} \rightarrow \mathbb{R}</math>, represented as <math>r = g\left(h\left(x\right)\right) </math>. <math>h\,</math> maps input <math>x\,</math> to the hidden input space, and <math>g\,</math> reconstructs <math>x\,</math>. When <math>L\left(x,g\left(h\left(x\right)\right)\right)</math> denotes the average reconstruction error, the objective function being optimized to learn the parameters <math>\theta\,</math> of the encoder/decoder is as follows:<br />
<br />
:<math> \mathcal{J}_{AE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) </math><br />
<br />
The form of the '''encoder''' is <math>h\left(x\right) = s\left(Wx + b_h\right)</math>, where <math>s\left(z\right) = \frac{1}{1 + e^{-z}}</math> is the element-wise logistic sigmoid. <math>W \in \mathbb{R}^{d_h \times d} </math> and <math>b_h \in \mathbb{R}^{d_h}</math> are the parameters (weight matrix and bias vector, respectively). The form of the '''decoder''' is <math>r = g\left(h\left(x\right)\right) = s_2\left(W^Th\left(x\right)+b_r\right)</math>, where <math>\,s_2 = s</math> or the identity. The weight matrix <math>W^T\,</math> is shared with the encoder, with the only new parameter being the bias vector <math>b_r \in \mathbb{R}^d</math>.<br />
<br />
The '''loss function''' can either be the squared error <math>L\left(x,r\right) = \|x - r\|^2</math> or the Bernoulli cross-entropy, given by: <br />
<br />
:<math> L\left(x, r\right) = -\sum_{i=1}^d \left[x_i \mbox{log}\left(r_i\right) + \left(1 - x_i\right)\mbox{log}\left(1 - r_i\right)\right]</math><br />
<br />
=== First- and Higher-Order Contractive Auto-Encoders ===<br />
<br />
==== Additional Penalty on Jacobian ==== <br />
<br />
The Contractive Auto-Encoder (CAE), proposed by Rifai et al.<ref name = "CAE"></ref>, encourages robustness of <math>h\left(x\right)</math> to small variations in <math>x</math> by penalizing the Frobenius norm of the encoder's Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. The new objective function to be minimized is:<br />
<br />
:<math> \mathcal{J}_{CAE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) + \lambda\|J\left(x\right)\|_F^2 </math><br />
<br />
where <math>\lambda</math> is a non-negative regularization parameter. We can compute the <math>j^{th}</math> row of the Jacobian of the sigmoidal encoder quite easily using the <math>j^{th}</math> row of <math>W</math>:<br />
<br />
:<math> J\left(x\right)_j = \frac{\partial h_j\left(x\right)}{\partial x} = h_j\left(x\right)\left(1 - h_j\left(x\right)\right)W_j</math><br />
<br />
==== Additional Penalty on Hessian ====<br />
<br />
It is also possible to penalize higher-order derivatives by approximating the Hessian (explicit computation of the Hessian is costly). It is sufficient to penalize the difference between <math>J\left(x\right)</math> and <math>J\left(x + \varepsilon\right)</math> where <math>\,\varepsilon </math> is small, as this represents the rate of change of the Jacobian. This yields the "CAE+H" variant, with objective function as follows:<br />
<br />
:<math> \mathcal{J}_{CAE+H}\left(\theta\right) = \mathcal{J}_{CAE}\left(\theta\right) + \gamma\sum_{x \in \mathcal{D}}\mathbb{E}_{\varepsilon\sim\mathcal{N}\left(0,\sigma^2I\right)} \left[\|J\left(x\right) - J\left(x + \varepsilon\right)\|^2\right] </math><br />
<br />
The expectation above, in practice, is taken over stochastic samples of the noise variable <math>\varepsilon\,</math> at each stochastic gradient descent step. <math>\gamma\,</math> is another regularization parameter. This formulation will be the one used within this paper.<br />
<br />
=== Characterizing the Tangent Bundle Captured by a CAE ===<br />
<br />
Although the regularization term encourages insensitivity of <math>h(x)</math> in all input space directions, the pressure to form an accurate reconstruction counters this somewhat, and the result is that <math>h(x)</math> is only sensitive to a few input directions necessary to distinguish close-by training points.<ref name = "CAE"></ref> Geometrically, the interpretation is that these directions span the local tangent space of the underlying manifold the characterizes the input data. <br />
<br />
==== Geometric Terms ====<br />
<br />
* '''Tangent Bundle''': The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points in it.<br />
* '''Chart''': A local Euclidean coordinate system equipped to a tangent plane. Each tangent plane has its own chart.<br />
* '''Atlas''': A collection of local charts.<br />
<br />
==== Conditions for Feature Mapping to Define an Atlas on a Manifold ====<br />
<br />
To obtain a proper atlas of charts, <math>h</math> must be a local diffeomorphism (locally smooth and invertible). Since the sigmoidal mapping is smooth, <math>\,h</math> is guaranteed to be smooth. To determine injectivity of <math>h\,</math>, consider the following, <math>\forall x_i, x_j \in \mathcal{D}</math>:<br />
<br />
:<math><br />
\begin{align}<br />
h(x_i) = h(x_j) &\Leftrightarrow s\left(Wx_i + b_h\right) = s\left(Wx_j + b_h\right) \\<br />
& \Leftrightarrow Wx_i + b_h = Wx_j + b_h \mbox{, since } s \mbox{ is invertible} \\<br />
& \Leftrightarrow W\Delta_{ij} = 0 \mbox{, where } \Delta_{ij} = x_i - x_j<br />
\end{align}<br />
</math><br />
<br />
Thus, as long as <math>W\,</math> forms a basis spanned by its rows <math>W_k\,</math> such that <math>\forall i,j \,\,\exists \alpha \in \mathbb{R}^{d_h} | \Delta_{ij} = \sum_{k=1}^{d_h}\alpha_k W_k</math>, then the injectivity of <math>h\left(x\right)</math> will be preserved (as this would imply <math>\Delta_{ij} = 0\,</math> above). Furthermore, if we limit the domain of <math>\,h</math> to <math>h\left(\mathcal{D}\right) \subset \left(0,1\right)^{d_h}</math>, containing only the values obtainable by <math>h\,</math> applied to the training set <math>\mathcal{D}</math>, then <math>\,h</math> is surjective by definition. Therefore, <math>\,h</math> will be bijective between <math>h\,</math> and <math>h\left(\mathcal{D}\right)</math>, meaning that <math>h\,</math> will be a local diffeomorphism around each point in the training set.<br />
<br />
==== Generating an Atlas from a Learned Feature Mapping ====<br />
<br />
We now need to determine how to generate local charts around each <math>x \in \mathcal{D}</math>. Since <math>h</math> must be sensitive to changes between <math>x_i</math> and one of its neighbours <math>x_j</math>, but insensitive to other changes, we expect this to be encoded in the spectrum of the Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. Thus, we define a local chart around <math>x</math> using the singular value decomposition of <math>\,J^T(x) = U(x)S(x)V^T(x)</math>. The tangent plane <math>\mathcal{H}_x</math> at <math>\,x</math> is then given by the span of the set of principal singular vectors <math>\mathcal{B}_x</math>, as long as the associated singular value is above a given small <math>\varepsilon\,</math>:<br />
<br />
:<math>\mathcal{B}_x = \{U_{:,k}(x) | S_{k,k}(x) > \varepsilon\} \mbox{ and } \mathcal{H}_x = \{x + v | v \in \mbox{span}\left(\mathcal{B}_x\right)\} </math><br />
<br />
where <math>U_{:,k}(x)\,</math> is the <math>k^{th}</math> column of <math>U\left(x\right)</math>. <br />
<br />
Then, we can define an atlas <math>\mathcal{A}</math> captured by <math>h\,</math>, based on the local linear approximation around each example:<br />
<br />
:<math> \mathcal{A} = \{\left(\mathcal{M}_x, \phi_x\right) | x\in\mathcal{D}, \phi_x\left(\tilde{x}\right) = \mathcal{B}_x\left(x - \tilde{x}\right)\}</math><br />
<br />
=== Exploiting Learned Directions for Classification ===<br />
<br />
We would like to use the local charts defined above as additional information for the task of classification. In doing so, we will adopt the '''manifold hypothesis for classification'''.<br />
<br />
==== CAE-Based Tangent Distance ====<br />
<br />
We start by defining the '''tangent distance''' between two points as the difference between their two respective hyperplanes <math>\mathcal{H}_x, \mathcal{H}_y</math> defined above, where distance is defined as:<br />
<br />
:<math> d\left(\mathcal{H}_x,\mathcal{H}_y\right) = \mbox{inf}\{\|z - w\|^2\,\, | \left(z,w\right) \in \mathcal{H}_x \times \mathcal{H}_y\}</math><br />
<br />
Finding this distance is a convex problem which is solvable by solving a system of linear equations.<ref>Simard, P., LeCun, Y., & Denker, J. S. (1993). [http://papers.nips.cc/paper/656-efficient-pattern-recognition-using-a-new-transformation-distance.pdf Efficient pattern recognition using a new transformation distance.] In Advances in neural information processing systems (pp. 50-58).</ref> Minimizing the distance in this way allows <math>x, y \in \mathcal{D}</math> to move along their associated tangent spaces, and have the distance evaluated where <math>x</math> and <math>y</math> are closest. A nearest-neighbour classifier could then be used based on this distance.<br />
<br />
==== CAE-Based Tangent Propagation ====<br />
<br />
Nearest-neighbour techniques work in theory, but are often impractical for large-scale datasets. Classifying test points in this way grows linearly with the number of training points. Neural networks, however, can quickly classify test points once they are trained. We would like the output <math>o</math> of the classifier to be insensitive to variations in the directions of the local chart around <math>x</math>. To this end, we add the following penalty to the objective function of the (supervised) network:<br />
<br />
:<math> \Omega\left(x\right) = \sum_{u \in \mathcal{B}_x} \left|\left| \frac{\partial o}{\partial x}\left(x\right) u \right|\right|^2 </math><br />
<br />
=== The Manifold Tangent Classifier (MTC) ===<br />
<br />
Finally, we are able to put all of the results together into a full algorithm for training a network. The steps follow below:<br />
<br />
# Train (unsupervised) a stack of <math>K\,</math> CAE+H layers as in section 2.2.2. Each layer is trained on the representation learned by the previous layer.<br />
# For each <math>x_i \in \mathcal{D}</math>, compute the Jacobian of the last layer representation <math>J^{(K)}(x_i) = \frac{\partial h^{(K)}}{\partial x}\left(x_i\right)</math> and its SVD. Note that <math>J^{(K)}\,</math> is the product of the Jacobians of each encoder. Store the leading <math>d_M\,</math> singular vectors in <math>\mathcal{B}_{x_i}</math>.<br />
# After the <math>K\,</math> CAE+H layers, add a sigmoidal output layer with a node for each class. Train the entire network for supervised classification, adding in the propagation penalty in 2.4.2. Note that for each <math>x_i, \mathcal{B}_{x_i}</math> contains the set of tangent vectors to use.<br />
<br />
== Related Work == <br />
<br />
There are a number of existing non-linear manifold learning algorithms (e.g. <ref>[http://web.mit.edu/cocosci/Papers/sci_reprint.pdf A Global Geometric Framework for Nonlinear Dimensionality Reduction] Tenenbaum et al., Science (2000)</ref> that learn the tangent bundle for a set of training points (i.e. the main directions of variation around each point). One drawback of these existing approaches is that they are typically non-parametric and use local parameters to define the tangent plane around each datapoint. This potentially results in manifold learning algorithms that require training data that grows exponentially with manifold dimension and curvature. <br />
<br />
The semi-supervised embedding algorithm <ref>[http://ronan.collobert.com/pub/matos/2008_deep_icml.pdf Deep learning via semi-supervised embedding] Weston et al., ICML (2008) </ref> is also related in that it encourages the hidden states of a network to be invariant with respect to changes to neighbouring datapoints in the training. The present work, however, initially aims for representations that are sensitive to such local variations, as explained above. <br />
<br />
== Results ==<br />
<br />
=== Datasets Considered ===<br />
<br />
The MTC was tested on the following datasets:<br />
<br />
*'''MNIST''': Set of 28 by 28 images of handwritten digits, and the goal is to predict the digit contained in the image.<br />
*'''Reuters Corpus Volume I''': Contains 800,000 real-world news stories. Used the 2000 most frequent words calculated on the whole dataset to create a bag-of-words representation.<br />
*'''CIFAR-10''': Dataset of 70,000 32 by 32 RGB real-world images. <br />
*'''Forest Cover Type''': Large-scale database of cartographic variables for prediction of forest cover types.<br />
<br />
=== Method ===<br />
<br />
To investigate the improvements made by CAE-learned tangents, the following method is employed: Optimal hyper-parameters (e.g. <math>\gamma, \lambda\,,</math> etc.) were selected by cross-validation on a disjoint validation set disjoint from the training set. The quality of the features extracted by the CAE is evaluated by initializing a standard multi-layer perceptron network with the same parameters as the trained CAE and fine-tuning it by backpropagation on the supervised task.<br />
<br />
=== Visualization of Learned Tangents === <br />
<br />
Figure 1 visualizes the tangents learned by CAE. The example is on the left, and 8 tangents are shown to the right. On the MNIST dataset, the tangents are small geometric transformations. For CIFAR-10, the tangents appear to be parts of the image. For Reuters, the tangents correspond to addition/removal of similar words, with the positive terms in green and the negative terms in red. We see that the tangents do not seem to change the class of the example (e.g. the tangents of the above "0" in MNIST all resemble zeroes).<br />
<br />
[[File:Figure_1_MTC.png|frame|center|Fig. 1: Tangents Extracted by CAE]]<br />
<br />
=== MTC in Semi-Supervised Setting ===<br />
<br />
The MTC method was evaluated on the MNIST dataset in a semi-supervised setting: the unsupervised feature extractor is trained on the full training set, and the supervised classifier is only trained on a restricted label set. The results with a single layer perceptron initialized with CAE+H pretraining (abbreviated CAE), and the same classifier with tangent propagation added (i.e. MTC) are in table 1. The performance is compared to other methods the do not consider the semi-supervised learning hypothesis (Support Vector Machines (SVM), Neural Networks (NN), Convolutional Neural Networks (CNN)), and those methods perform poorly against MTC, especially when labeled data is low. <br />
<br />
{| class="wikitable"<br />
|+Table 1: Semi-Supervised classification error on MNIST test set<br />
|-<br />
|'''# Labeled'''<br />
|'''NN'''<br />
|'''SVM'''<br />
|'''CNN'''<br />
|'''CAE'''<br />
|'''MTC'''<br />
|-<br />
|100<br />
|25.81<br />
|23.44<br />
|22.98<br />
|13.47<br />
|'''12.03'''<br />
|-<br />
|600<br />
|11.44<br />
|8.85<br />
|7.68<br />
|6.3<br />
|'''5.13'''<br />
|-<br />
|1000<br />
|10.7<br />
|7.77<br />
|6.45<br />
|4.77<br />
|'''3.64'''<br />
|-<br />
|3000<br />
|6.04<br />
|4.21<br />
|3.35<br />
|3.22<br />
|'''2.57''' <br />
|}<br />
<br />
=== MTC in Full Classification Problems ===<br />
<br />
We consider using MTC to classify using the full MNIST dataset (i.e. the fully supervised problem), and compare with other methods. The CAE used for tangent discovery is a two-layer deep network with 2000 units per-layer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained with tangent propagation, using <math>d_M = 15\,</math> tangents. The MTC produces state-of-the-art results, achieving a 0.81% error on the test set (as opposed to the previous state-of-the-art result of 0.95% error, achieved by Deep Boltzmann Machines). Table 2 summarizes this result. Note that MTC also beats out CNN, which utilizes prior knowledge about vision using convolutions and pooling.<br />
<br />
{| class="wikitable"<br />
|+Table 2: Class. error on MNIST Test Set with full Training Set<br />
|-<br />
|K-NN<br />
|NN<br />
|SVM<br />
|CAE<br />
|DBM<br />
|CNN<br />
|MTC<br />
|-<br />
|3.09%<br />
|1.60%<br />
|1.40%<br />
|1.04%<br />
|0.95%<br />
|0.95%<br />
|'''0.81'''%<br />
|}<br />
<br />
A 4-layer MTC was trained on the Forest CoverType dataset. The MTC produces the best performance on this classification task, beating out the previous best method which used a mixture of non-linear SVMs (denoted as distributed SVM).<br />
<br />
{| class="wikitable"<br />
|+Table 3: Class. error on Forest Data<br />
|-<br />
|SVM<br />
|Distributed SVM<br />
|MTC<br />
|-<br />
|4.11%<br />
|3.46%<br />
|'''3.13'''%<br />
|}<br />
<br />
== Conclusion ==<br />
<br />
This paper unifies three common generic prior hypotheses in a unified manner. It uses a semi-supervised manifold approach to examine local charts around training points in the data, and then uses the tangents generated by these local charts to compare different classes. The tangents that are generated seem to be a meaningful decompositions of the training examples. When combining the tangents with the classifier, state-of-the-art results are obtained on classification problems in a variety of domains.<br />
<br />
== Discussion ==<br />
<br />
* I thought about how it could be possible to use an element-wise rectified linear unit <math>R\left(x\right) = \mbox{max}\left(0,x\right)</math> in place of the sigmoidal function for encoding, as this type of functional form has seen success in other deep learning methods. However, I believe that this type of functional form would preclude <math>h</math> from being diffeomorphic, as the <math>x</math>-values that are negative could not possibly be reconstructed. Thus, the sigmoidal form should likely be retained, although it would be interesting to see how other invertible non-linearities would perform (e.g. hyperbolic tangent).<br />
<br />
* It would be interesting to investigate applying the method of tangent extraction to other unsupervised methods, and then create a classifier based on these tangents in the same way that it is done in this paper. Further work could be done to apply this approach to clustering algorithms, kernel PCA, E-M, etc. This is more of a suggestion than a concrete idea, however.<br />
<br />
* It is not exactly clear to me how a <math>h</math> could ever define a true diffeomorphism, since <math>h: \mathbb{R}^{d} \mapsto \mathbb{R}^{d_h}</math>, where <math>d \ne d_h</math>, in general. Clearly, if <math>d > d_h</math>, such a map could not be injective. However, they may be able to "manufacture" the injectivity of <math>h</math> using the fact that <math>\mathcal{D}</math> is a discrete set of points. I'm not sure that this approach defines a continuous manifold, but I'm also not sure if that really matters in this case.<br />
<br />
== Bibliography ==<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=the_Manifold_Tangent_Classifier&diff=27359the Manifold Tangent Classifier2015-12-19T00:30:16Z<p>Derek: /* Discussion */</p>
<hr />
<div>== Introduction ==<br />
<br />
The goal in many machine learning problems is to extract information from data with minimal prior knowledge<ref name = "main"> Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., & Muller, X. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_1240.pdf The manifold tangent classifier.] In Advances in Neural Information Processing Systems (pp. 2294-2302). </ref> These algorithms are designed to work on numerous problems which they may not be specifically tailored towards, thus domain-specific knowledge is generally not incorporated into the models. However, some generic "prior" hypotheses are considered to aid in the general task of learning, and three very common ones are presented below:<br />
<br />
# The '''semi-supervised learning hypothesis''': This states that knowledge of the input distribution <math>p\left(x\right)</math> can aid in learning the output distribution <math>p\left(y|x\right)</math> .<ref>Lasserre, J., Bishop, C. M., & Minka, T. P. (2006, June). [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1640745 Principled hybrids of generative and discriminative models.] In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 87-94). IEEE.</ref> This hypothesis lends credence to not only the theory of strict semi-supervised learning, but also unsupervised pretraining as a method of feature extraction.<ref> Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). [http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.7.1527 A fast learning algorithm for deep belief nets.] Neural computation, 18(7), 1527-1554.</ref><ref>Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. (2010). [http://delivery.acm.org/10.1145/1760000/1756025/p625-erhan.pdf?ip=129.97.89.222&id=1756025&acc=PUBLIC&key=FD0067F557510FFB%2E9219CF56F73DCF78%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=561475515&CFTOKEN=96787671&__acm__=1447710319_1ea806f74c2b3b6959e97d9d0e03d533 Why does unsupervised pre-training help deep learning?.] The Journal of Machine Learning Research, 11, 625-660.</ref><br />
# The '''unsupervised manifold hypothesis''': This states that real-world data presented in high-dimensional spaces is likely to concentrate around a low-dimensional sub-manifold.<ref>Cayton, L. (2005). [http://www.vis.lbl.gov/~romano/mlgroup/papers/manifold-learning.pdf Algorithms for manifold learning.] Univ. of California at San Diego Tech. Rep, 1-17.</ref><br />
# The '''manifold hypothesis for classification''': This states that points of different classes are likely to concentrate along different sub-manifolds, separated by low-density regions of the input space.<ref name = "main"></ref><br />
<br />
The recently-proposed Contractive Auto-Encoder (CAE) algorithm has shown success in the task of unsupervised feature extraction,<ref name = "CAE">Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). [http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf Contractive auto-encoders: Explicit invariance during feature extraction.] In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 833-840).</ref> with its successful application in pre-training of Deep Neural Networks (DNN) an illustration of the merits of adopting '''Hypothesis 1'''. CAE also yields a mostly contractive mapping that is locally only sensitive to a few input directions, which implies that it models a lower-dimensional manifold (exploiting '''Hypothesis 2''') since the directions of sensitivity are in the tangent space of the manifold. <br />
<br />
This paper furthers the previous work by using the information about the tangent spaces by considering '''Hypothesis 3''': it extracts basis vectors for the local tangent space around each training point from the parameters of the CAE. Then, older supervised classification algorithms that exploit tangent directions as domain-specific prior knowledge can be used on the tangent spaces generated by CAE for fine-tuning the overall classification network. This approach seamlessly integrates all three of the above hypotheses and produces record-breaking results (for 2011) on image classification.<br />
<br />
== Contractive Auto-Encoders (CAE) and Tangent Classification ==<br />
<br />
The problem is to find a non-linear feature extractor for a dataset <math>\mathcal{D} = \{x_1, \ldots, x_n\}</math>, where <math>x_i \in \mathbb{R}^d</math> are i.i.d. samples from an unknown distribution <math> p\left(x\right)</math>.<br />
<br />
=== Traditional Auto-Encoders === <br />
<br />
A traditional auto-encoder learns an '''encoder''' function <math>h: \mathbb{R}^d \rightarrow \mathbb{R}^{d_h}</math> along with a '''decoder''' function <math>g: \mathbb{R}^{d_h} \rightarrow \mathbb{R}</math>, represented as <math>r = g\left(h\left(x\right)\right) </math>. <math>h\,</math> maps input <math>x\,</math> to the hidden input space, and <math>g\,</math> reconstructs <math>x\,</math>. When <math>L\left(x,g\left(h\left(x\right)\right)\right)</math> denotes the average reconstruction error, the objective function being optimized to learn the parameters <math>\theta\,</math> of the encoder/decoder is as follows:<br />
<br />
:<math> \mathcal{J}_{AE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) </math><br />
<br />
The form of the '''encoder''' is <math>h\left(x\right) = s\left(Wx + b_h\right)</math>, where <math>s\left(z\right) = \frac{1}{1 + e^{-z}}</math> is the element-wise logistic sigmoid. <math>W \in \mathbb{R}^{d_h \times d} </math> and <math>b_h \in \mathbb{R}^{d_h}</math> are the parameters (weight matrix and bias vector, respectively). The form of the '''decoder''' is <math>r = g\left(h\left(x\right)\right) = s_2\left(W^Th\left(x\right)+b_r\right)</math>, where <math>\,s_2 = s</math> or the identity. The weight matrix <math>W^T\,</math> is shared with the encoder, with the only new parameter being the bias vector <math>b_r \in \mathbb{R}^d</math>.<br />
<br />
The '''loss function''' can either be the squared error <math>L\left(x,r\right) = \|x - r\|^2</math> or the Bernoulli cross-entropy, given by: <br />
<br />
:<math> L\left(x, r\right) = -\sum_{i=1}^d \left[x_i \mbox{log}\left(r_i\right) + \left(1 - x_i\right)\mbox{log}\left(1 - r_i\right)\right]</math><br />
<br />
=== First- and Higher-Order Contractive Auto-Encoders ===<br />
<br />
==== Additional Penalty on Jacobian ==== <br />
<br />
The Contractive Auto-Encoder (CAE), proposed by Rifai et al.<ref name = "CAE"></ref>, encourages robustness of <math>h\left(x\right)</math> to small variations in <math>x</math> by penalizing the Frobenius norm of the encoder's Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. The new objective function to be minimized is:<br />
<br />
:<math> \mathcal{J}_{CAE}\left(\theta\right) = \sum_{x\in\mathcal{D}}L\left(x,g\left(h\left(x\right)\right)\right) + \lambda\|J\left(x\right)\|_F^2 </math><br />
<br />
where <math>\lambda</math> is a non-negative regularization parameter. We can compute the <math>j^{th}</math> row of the Jacobian of the sigmoidal encoder quite easily using the <math>j^{th}</math> row of <math>W</math>:<br />
<br />
:<math> J\left(x\right)_j = \frac{\partial h_j\left(x\right)}{\partial x} = h_j\left(x\right)\left(1 - h_j\left(x\right)\right)W_j</math><br />
<br />
==== Additional Penalty on Hessian ====<br />
<br />
It is also possible to penalize higher-order derivatives by approximating the Hessian (explicit computation of the Hessian is costly). It is sufficient to penalize the difference between <math>J\left(x\right)</math> and <math>J\left(x + \varepsilon\right)</math> where <math>\,\varepsilon </math> is small, as this represents the rate of change of the Jacobian. This yields the "CAE+H" variant, with objective function as follows:<br />
<br />
:<math> \mathcal{J}_{CAE+H}\left(\theta\right) = \mathcal{J}_{CAE}\left(\theta\right) + \gamma\sum_{x \in \mathcal{D}}\mathbb{E}_{\varepsilon\sim\mathcal{N}\left(0,\sigma^2I\right)} \left[\|J\left(x\right) - J\left(x + \varepsilon\right)\|^2\right] </math><br />
<br />
The expectation above, in practice, is taken over stochastic samples of the noise variable <math>\varepsilon\,</math> at each stochastic gradient descent step. <math>\gamma\,</math> is another regularization parameter. This formulation will be the one used within this paper.<br />
<br />
=== Characterizing the Tangent Bundle Captured by a CAE ===<br />
<br />
Although the regularization term encourages insensitivity of <math>h(x)</math> in all input space directions, the pressure to form an accurate reconstruction counters this somewhat, and the result is that <math>h(x)</math> is only sensitive to a few input directions necessary to distinguish close-by training points.<ref name = "CAE"></ref> Geometrically, the interpretation is that these directions span the local tangent space of the underlying manifold the characterizes the input data. <br />
<br />
==== Geometric Terms ====<br />
<br />
* '''Tangent Bundle''': The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points in it.<br />
* '''Chart''': A local Euclidean coordinate system equipped to a tangent plane. Each tangent plane has its own chart.<br />
* '''Atlas''': A collection of local charts.<br />
<br />
==== Conditions for Feature Mapping to Define an Atlas on a Manifold ====<br />
<br />
To obtain a proper atlas of charts, <math>h</math> must be a local diffeomorphism (locally smooth and invertible). Since the sigmoidal mapping is smooth, <math>\,h</math> is guaranteed to be smooth. To determine injectivity of <math>h\,</math>, consider the following, <math>\forall x_i, x_j \in \mathcal{D}</math>:<br />
<br />
:<math><br />
\begin{align}<br />
h(x_i) = h(x_j) &\Leftrightarrow s\left(Wx_i + b_h\right) = s\left(Wx_j + b_h\right) \\<br />
& \Leftrightarrow Wx_i + b_h = Wx_j + b_h \mbox{, since } s \mbox{ is invertible} \\<br />
& \Leftrightarrow W\Delta_{ij} = 0 \mbox{, where } \Delta_{ij} = x_i - x_j<br />
\end{align}<br />
</math><br />
<br />
Thus, as long as <math>W\,</math> forms a basis spanned by its rows <math>W_k\,</math> such that <math>\forall i,j \,\,\exists \alpha \in \mathbb{R}^{d_h} | \Delta_{ij} = \sum_{k=1}^{d_h}\alpha_k W_k</math>, then the injectivity of <math>h\left(x\right)</math> will be preserved (as this would imply <math>\Delta_{ij} = 0\,</math> above). Furthermore, if we limit the domain of <math>\,h</math> to <math>h\left(\mathcal{D}\right) \subset \left(0,1\right)^{d_h}</math>, containing only the values obtainable by <math>h\,</math> applied to the training set <math>\mathcal{D}</math>, then <math>\,h</math> is surjective by definition. Therefore, <math>\,h</math> will be bijective between <math>h\,</math> and <math>h\left(\mathcal{D}\right)</math>, meaning that <math>h\,</math> will be a local diffeomorphism around each point in the training set.<br />
<br />
==== Generating an Atlas from a Learned Feature Mapping ====<br />
<br />
We now need to determine how to generate local charts around each <math>x \in \mathcal{D}</math>. Since <math>h</math> must be sensitive to changes between <math>x_i</math> and one of its neighbours <math>x_j</math>, but insensitive to other changes, we expect this to be encoded in the spectrum of the Jacobian <math>J\left(x\right) = \frac{\partial h}{\partial x}\left(x\right)</math>. Thus, we define a local chart around <math>x</math> using the singular value decomposition of <math>\,J^T(x) = U(x)S(x)V^T(x)</math>. The tangent plane <math>\mathcal{H}_x</math> at <math>\,x</math> is then given by the span of the set of principal singular vectors <math>\mathcal{B}_x</math>, as long as the associated singular value is above a given small <math>\varepsilon\,</math>:<br />
<br />
:<math>\mathcal{B}_x = \{U_{:,k}(x) | S_{k,k}(x) > \varepsilon\} \mbox{ and } \mathcal{H}_x = \{x + v | v \in \mbox{span}\left(\mathcal{B}_x\right)\} </math><br />
<br />
where <math>U_{:,k}(x)\,</math> is the <math>k^{th}</math> column of <math>U\left(x\right)</math>. <br />
<br />
Then, we can define an atlas <math>\mathcal{A}</math> captured by <math>h\,</math>, based on the local linear approximation around each example:<br />
<br />
:<math> \mathcal{A} = \{\left(\mathcal{M}_x, \phi_x\right) | x\in\mathcal{D}, \phi_x\left(\tilde{x}\right) = \mathcal{B}_x\left(x - \tilde{x}\right)\}</math><br />
<br />
=== Exploiting Learned Directions for Classification ===<br />
<br />
We would like to use the local charts defined above as additional information for the task of classification. In doing so, we will adopt the '''manifold hypothesis for classification'''.<br />
<br />
==== CAE-Based Tangent Distance ====<br />
<br />
We start by defining the '''tangent distance''' between two points as the difference between their two respective hyperplanes <math>\mathcal{H}_x, \mathcal{H}_y</math> defined above, where distance is defined as:<br />
<br />
:<math> d\left(\mathcal{H}_x,\mathcal{H}_y\right) = \mbox{inf}\{\|z - w\|^2\,\, | \left(z,w\right) \in \mathcal{H}_x \times \mathcal{H}_y\}</math><br />
<br />
Finding this distance is a convex problem which is solvable by solving a system of linear equations.<ref>Simard, P., LeCun, Y., & Denker, J. S. (1993). [http://papers.nips.cc/paper/656-efficient-pattern-recognition-using-a-new-transformation-distance.pdf Efficient pattern recognition using a new transformation distance.] In Advances in neural information processing systems (pp. 50-58).</ref> Minimizing the distance in this way allows <math>x, y \in \mathcal{D}</math> to move along their associated tangent spaces, and have the distance evaluated where <math>x</math> and <math>y</math> are closest. A nearest-neighbour classifier could then be used based on this distance.<br />
<br />
==== CAE-Based Tangent Propagation ====<br />
<br />
Nearest-neighbour techniques work in theory, but are often impractical for large-scale datasets. Classifying test points in this way grows linearly with the number of training points. Neural networks, however, can quickly classify test points once they are trained. We would like the output <math>o</math> of the classifier to be insensitive to variations in the directions of the local chart around <math>x</math>. To this end, we add the following penalty to the objective function of the (supervised) network:<br />
<br />
:<math> \Omega\left(x\right) = \sum_{u \in \mathcal{B}_x} \left|\left| \frac{\partial o}{\partial x}\left(x\right) u \right|\right|^2 </math><br />
<br />
=== The Manifold Tangent Classifier (MTC) ===<br />
<br />
Finally, we are able to put all of the results together into a full algorithm for training a network. The steps follow below:<br />
<br />
# Train (unsupervised) a stack of <math>K\,</math> CAE+H layers as in section 2.2.2. Each layer is trained on the representation learned by the previous layer.<br />
# For each <math>x_i \in \mathcal{D}</math>, compute the Jacobian of the last layer representation <math>J^{(K)}(x_i) = \frac{\partial h^{(K)}}{\partial x}\left(x_i\right)</math> and its SVD. Note that <math>J^{(K)}\,</math> is the product of the Jacobians of each encoder. Store the leading <math>d_M\,</math> singular vectors in <math>\mathcal{B}_{x_i}</math>.<br />
# After the <math>K\,</math> CAE+H layers, add a sigmoidal output layer with a node for each class. Train the entire network for supervised classification, adding in the propagation penalty in 2.4.2. Note that for each <math>x_i, \mathcal{B}_{x_i}</math> contains the set of tangent vectors to use.<br />
<br />
== Related Work == <br />
<br />
There are a number of existing non-linear manifold learning algorithms (e.g. <ref>[http://web.mit.edu/cocosci/Papers/sci_reprint.pdf A Global Geometric Framework for Nonlinear Dimensionality Reduction] Tenenbaum et al., Science (2000)</ref> that learn the tangent bundle for a set of training points (i.e. the main directions of variation around each point). One drawback of these existing approaches is that they are typically non-parametric and use local parameters to define the tangent plane around each datapoint. This potentially results in manifold learning algorithms that require training data that grows exponentially with manifold dimension and curvature. <br />
<br />
The semi-supervised embedding algorithm <ref>[http://ronan.collobert.com/pub/matos/2008_deep_icml.pdf Deep learning via semi-supervised embedding] Weston et al., ICML (2008) </ref> is also related in that it encourages the hidden states of a network to be invariant with respect to changes to neighbouring datapoints in the training. The present work, however, initially aims for representations that are sensitive to such local variations, as explained above. <br />
<br />
== Results ==<br />
<br />
=== Datasets Considered ===<br />
<br />
The MTC was tested on the following datasets:<br />
<br />
*'''MNIST''': Set of 28 by 28 images of handwritten digits, and the goal is to predict the digit contained in the image.<br />
*'''Reuters Corpus Volume I''': Contains 800,000 real-world news stories. Used the 2000 most frequent words calculated on the whole dataset to create a bag-of-words representation.<br />
*'''CIFAR-10''': Dataset of 70,000 32 by 32 RGB real-world images. <br />
*'''Forest Cover Type''': Large-scale database of cartographic variables for prediction of forest cover types.<br />
<br />
=== Method ===<br />
<br />
To investigate the improvements made by CAE-learned tangents, the following method is employed: Optimal hyper-parameters (e.g. <math>\gamma, \lambda\,,</math> etc.) were selected by cross-validation on a disjoint validation set disjoint from the training set. The quality of the features extracted by the CAE is evaluated by initializing a standard multi-layer perceptron network with the same parameters as the trained CAE and fine-tuning it by backpropagation on the supervised task.<br />
<br />
=== Visualization of Learned Tangents === <br />
<br />
Figure 1 visualizes the tangents learned by CAE. The example is on the left, and 8 tangents are shown to the right. On the MNIST dataset, the tangents are small geometric transformations. For CIFAR-10, the tangents appear to be parts of the image. For Reuters, the tangents correspond to addition/removal of similar words, with the positive terms in green and the negative terms in red. We see that the tangents do not seem to change the class of the example (e.g. the tangents of the above "0" in MNIST all resemble zeroes).<br />
<br />
[[File:Figure_1_MTC.png|frame|center|Fig. 1: Tangents Extracted by CAE]]<br />
<br />
=== MTC in Semi-Supervised Setting ===<br />
<br />
The MTC method was evaluated on the MNIST dataset in a semi-supervised setting: the unsupervised feature extractor is trained on the full training set, and the supervised classifier is only trained on a restricted label set. The results with a single layer perceptron initialized with CAE+H pretraining (abbreviated CAE), and the same classifier with tangent propagation added (i.e. MTC) are in table 1. The performance is compared to other methods the do not consider the semi-supervised learning hypothesis (Support Vector Machines (SVM), Neural Networks (NN), Convolutional Neural Networks (CNN)), and those methods perform poorly against MTC, especially when labeled data is low. <br />
<br />
{| class="wikitable"<br />
|+Table 1: Semi-Supervised classification error on MNIST test set<br />
|-<br />
|'''# Labeled'''<br />
|'''NN'''<br />
|'''SVM'''<br />
|'''CNN'''<br />
|'''CAE'''<br />
|'''MTC'''<br />
|-<br />
|100<br />
|25.81<br />
|23.44<br />
|22.98<br />
|13.47<br />
|'''12.03'''<br />
|-<br />
|600<br />
|11.44<br />
|8.85<br />
|7.68<br />
|6.3<br />
|'''5.13'''<br />
|-<br />
|1000<br />
|10.7<br />
|7.77<br />
|6.45<br />
|4.77<br />
|'''3.64'''<br />
|-<br />
|3000<br />
|6.04<br />
|4.21<br />
|3.35<br />
|3.22<br />
|'''2.57''' <br />
|}<br />
<br />
=== MTC in Full Classification Problems ===<br />
<br />
We consider using MTC to classify using the full MNIST dataset (i.e. the fully supervised problem), and compare with other methods. The CAE used for tangent discovery is a two-layer deep network with 2000 units per-layer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained with tangent propagation, using <math>d_M = 15\,</math> tangents. The MTC produces state-of-the-art results, achieving a 0.81% error on the test set (as opposed to the previous state-of-the-art result of 0.95% error, achieved by Deep Boltzmann Machines). Table 2 summarizes this result. Note that MTC also beats out CNN, which utilizes prior knowledge about vision using convolutions and pooling.<br />
<br />
{| class="wikitable"<br />
|+Table 2: Class. error on MNIST Test Set with full Training Set<br />
|-<br />
|K-NN<br />
|NN<br />
|SVM<br />
|CAE<br />
|DBM<br />
|CNN<br />
|MTC<br />
|-<br />
|3.09%<br />
|1.60%<br />
|1.40%<br />
|1.04%<br />
|0.95%<br />
|0.95%<br />
|'''0.81'''%<br />
|}<br />
<br />
A 4-layer MTC was trained on the Forest CoverType dataset. The MTC produces the best performance on this classification task, beating out the previous best method which used a mixture of non-linear SVMs (denoted as distributed SVM).<br />
<br />
{| class="wikitable"<br />
|+Table 3: Class. error on Forest Data<br />
|-<br />
|SVM<br />
|Distributed SVM<br />
|MTC<br />
|-<br />
|4.11%<br />
|3.46%<br />
|'''3.13'''%<br />
|}<br />
<br />
== Conclusion ==<br />
<br />
This paper unifies three common generic prior hypotheses in a unified manner. It uses a semi-supervised manifold approach to examine local charts around training points in the data, and then uses the tangents generated by these local charts to compare different classes. The tangents that are generated seem to be a meaningful decompositions of the training examples. When combining the tangents with the classifier, state-of-the-art results are obtained on classification problems in a variety of domains.<br />
<br />
== Discussion ==<br />
<br />
* I thought about how it could be possible to use an element-wise rectified linear unit <math>R\left(x\right) = \mbox{max}\left(0,x\right)</math> in place of the sigmoidal function for encoding, as this type of functional form has seen success in other deep learning methods. However, I believe that this type of functional form would preclude <math>h</math> from being diffeomorphic, as the <math>x</math>-values that are negative could not possibly be reconstructed. Thus, the sigmoidal form should likely be retained, although it would be interesting to see how other invertible non-linearities would perform (e.g. hyperbolic tangent).<br />
<br />
* It would be interesting to investigate applying the method of tangent extraction to other unsupervised methods, and then create a classifier based on these tangents in the same way that it is done in this paper. Further work could be done to apply this approach to clustering algorithms, kernel PCA, E-M, etc. This is more of a suggestion than a concrete idea, however.<br />
<br />
* It is not exactly clear to me how a <math>h</math> could ever define a true diffeomorphism, since <math>h: \mathbb{R}^{d} \mapsto \mathbb{R}^{d_h}</math>, where <math>d \ne d_h</math>, in general. Clearly, if <math>d > d_h</math>, we would not expect such a map <math>h</math> to possibly be injective. However, they may be able to "manufacture" the injectivity of <math>h</math> using the fact that <math>\mathcal{D}</math> is a discrete set of points. I'm not sure that this approach defines a continuous manifold, but I'm also not sure if that really matters in this case.<br />
<br />
== Bibliography ==<br />
<references /></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_Number_of_Linear_Regions_of_Deep_Neural_Networks&diff=27065on the Number of Linear Regions of Deep Neural Networks2015-12-04T03:45:51Z<p>Derek: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
The paper basically seeks to answer the question why deep neural networks perform so much better than shallow neural networks. It is not obvious that deep neural networks should perform any better. For example, Funahashi 1989 showed that a neural network with just one hidden layer is a universal function approximator (given sufficiently many neurons). Thus, the class of functions a deep neural network can approximate cannot be larger. Furthermore, having many layers can theoretically cause problems due to vanishing gradients.<br />
<br />
As both shallow and deep neural networks can approximate the same class of functions, another method of comparison is needed. For this we have to consider what neural networks do. Basically, they split the input space in piecewise linear units. It seems that deep neural networks have more segments (with the same number of neurons) which allows them to produce a more complex function approximate. Essentially, after partitioning the original input space piecewise linearly, each subsequent layer recognizes pieces of the original input such that the composition of these layers correspondingly identifies an exponential number of input regions. This is caused by the deep hierarchy which allows to apply the same computation across different regions of the input space.<br />
<br />
[[File:montifar1.png]]<br />
<br />
= Shallow Neural Networks =<br />
<br />
First, an upper limit of regions a shallow neural network produces is derived. This gives not only a measure of the approximation complexity possible with a shallow neural network, but will also be used to obtain the number of regions for deep neural networks.<br />
<br />
The hidden layer of a shallow neural network with <math>n_0</math> inputs and <math>n_1</math> hidden units essentially computes <math>\mathbf{x} \mapsto g(\mathbf{W}\mathbf{x} + \mathbf{b})</math> with input <math>\mathbf{x}</math>, weight matrix <math>\mathbf{W}</math>, bias vector <math>\mathbf{b}</math>, and non-linearity <math>\, g</math>. If the non-linearity of <math>g</math> is at 0 or if there is an inflection at 0, this gives a distinguished behavior for <math>\mathbf{W}\mathbf{x} <br />
+ \mathbf{b} = 0</math> which can act as decision boundary and represents a hyperplane.<br />
<br />
Let us consider the set <math>H_i := \{\mathbf{x} \in \mathbb{R}^{n_0}: \mathbf{W}_{i,:}\mathbf{x} <br />
+ \mathbf{b}_i = 0\}</math> of all those hyperplanes (<math>i \in [n_1]</math>). This set splits the input space in several regions (formally defined as a connected component of <math>\mathbb{R}^{n_0} <br />
\setminus (\cup_i H_i)</math>).<br />
<br />
[[File:hyperplanes.png]]<br />
<br />
With <math>n_1</math> hyperplanes (in general alignment) there will be at most <math>\sum_{j=0}^{n_0} \binom{n_1}{j}</math> regions.<br />
<br />
= Deep Neural Networks =<br />
<br />
A hidden layer <math>l</math> of a deep neural network computes a function <math>h_l</math> which maps a set <math>S_{l-1} \in \mathbb{R}^{n_{l-1}}</math> to another set <math>S_{l} \in <br />
\mathbb{R}^{n_l}</math>. In this mapping there might be subsets <math>\bar{R}_1, \dots, <br />
\bar{R}_k \subseteq S_{l-1}</math> that get mapped onto the same subset <math>R \subseteq <br />
S_l</math>, i.e. <math>h_l(\bar{R}_1) = \cdots = h_l(\bar{R}_k) = R</math>. The set of all these subsets is denoted with <math>P_R^l</math>.<br />
<br />
[[File:sets.png]]<br />
<br />
This allows to define the number of separate input-space neighbourhoods mapped onto a common neighbourhood <math>R</math>. For each subset <math>\bar{R}_i</math> that maps to <math>R</math> we have to add up the number of subsets mapping to <math>\bar{R}_i</math> giving the recursive formula <math>\mathcal{N}_R^l = \sum_{R' \in P_R^l} \mathcal{N}_{R'}^{l-1}</math> with <math>\mathcal{N}_R^0 = 1</math> for each region <math>R \subseteq \mathbb{R}^{n_0}</math> in the input space. Applying this formula for each distinct linear region computed by the last hidden layer, a set denoted with <math>P^L</math>, we get the maximal number of linear regions of the functions computed by an <math>L</math>-layer neural network with piecewise linear activations as <math>\mathcal{N} = \sum_{R \in P^L} \mathcal{N}_R^{L-1} \text{.}</math><br />
<br />
= Space Folding =<br />
<br />
An intuition of the process of mapping input-space neighbourhoods to common neighbourhoods can be given in terms of space folding. Each such mapping can be seen as folding the input space so that the input-space neighbourhoods are overlayed. Thus, each hidden layer of a deep neural network can be associated with a folding operator and any function computed on the final folded space will be applied to all regions successively folded onto each other. Note that the foldings are encoded in the weight matrix <math>\mathbf{W}</math>, bias vector <math>\mathbf{b}</math> and activation function <math>g</math>. This allows for foldings separate from the coordinate axes and non-length preserving foldings.<br />
<br />
[[File:montifar2.png]]<br />
[[File:montifar3.png]]<br />
<br />
= Deep Rectifier Networks =<br />
<br />
To obtain a lower bound on the maximal number of linear regions computable by a deep rectifier network, a network is constructed in such a way that the number of linear regions mapped onto each other is maximized. Each of <math>n</math> units in a layer of rectifiers will only process one of the <math>n_0</math> inputs. This gives a partition of rectifier units where each partition has a cardinality of <math>p <br />
= \lfloor n/n_0 \rfloor</math> (ignoring remaining units). For each subset <math>j</math> we select the <math>j</math>-th input with a row vector <math>\mathbf{w}</math> with the <math>j</math>-th entry 1 and the remaining entries 0. The bias values are included in these activation functions for the <math>p</math> units: <math>h_1(\mathbf{x}) = \max \{ 0, \mathbf{w}\mathbf{x} \}</math> <math>h_i(\mathbf{x}) = \max \{ 0, 2\mathbf{w}\mathbf{x} - i + 1 \}, \quad <br />
1 < i \leq p</math> Next, these activation functions are added with alternating signs. Note that this calculation can be absorbed in the connections weights to the next layer. <math>\tilde{h}_j(\mathbf{x}) = h_1(\mathbf{x}) - h_2(\mathbf{x}) <br />
+ h_3(\mathbf{x}) - \cdots + {(-1)}^{p-1} h_p(\mathbf{x})</math> This gives us a function which folds <math>p</math> segments <math>(-\infty, 0],\ [0, 1], [1, <br />
2],\ \ldots,\ [p - 1, \infty)</math> onto the interval <math>(0, 1)</math>.<br />
<br />
[[File:constr.png]]<br />
<br />
Going from these <math>n_0</math> functions for subsets of rectifiers to the full <math>n_0</math> dimensional function <math>\tilde{h} = {[\tilde{h}_1, \tilde{h}_2, \ldots, <br />
\tilde{h}_{n_0}]}^{\top}</math> gives a total of <math>p^{n_0}</math> hypercubes mapped onto the same output.<br />
<br />
Counting the number of separate regions produced by the last layer and multiplying this together with number of regions mapping to this layer, we get <math>\underbrace{\left( \prod_{i=1}^{L-1} {\left\lfloor \frac{n_i}{n_0} \right\rfloor}^{n_0} <br />
\right)}_{\text{mapped hypercubes}} \cdot \underbrace{\sum_{j=0}^{n_0} <br />
\binom{n_L}{j}}_{\text{last layer (shallow net)}}</math> as the lower bound of the maximal number of linear regions of functions computed by a deep rectifier network with <math>n_0</math> inputs and <math>L</math> hidden layers. We can also denote this lower bound with <math>\Omega\!\left({\left(\frac{n}{n_0}\right)}^{(L-1)n_o} n^{n_0}\right)</math> which makes it clear that this number grows exponentially with <math>L</math> versus a polynomial scaling of a shallow model with <math>nL</math> hidden units.<br />
<br />
In fact, it is possible to obtain asymptotic bounds on the number of linear regions per parameter in the neural network models:<br />
<br />
* For a deep model, the asymptotic bound is exponential: <math>\Omega\left(\left(n/n_0\right)^{n_0(L-1)}\frac{n^{n_0-2}}{L}\right)</math><br />
* For a shallow model, the asymptotic bound is polynomial: <math>O(L^{n_0-1}n^{n_0-1})</math><br />
<br />
= Conclusion =<br />
<br />
The number of piecewise linear segments the input space can be split into grows exponentially with the number of layers of a deep neural network, whereas the growth is only polynomial with the number of neurons. This explains why deep neural networks perform so much better than shallow neural networks. The paper showed this result for deep rectifier networks and deep maxout networks, but the same analysis should be applicable to other types of deep neural networks.<br />
<br />
Furthermore, the paper provides a useful intuition in terms of space folding to think about deep neural networks.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=mULTIPLE_OBJECT_RECOGNITION_WITH_VISUAL_ATTENTION&diff=27063mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION2015-12-04T03:34:03Z<p>Derek: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Recognizing multiple objects in images has been one of the most important goals of computer vision. Previous work in this classification of sequences of characters often employed a sliding window detector with an individual character-classifier. However, these systems can involve setting components in a case-specific manner for determining possible object locations. In this paper an attention-based model for recognizing multiple objects in images is presented. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. It has been shown that the proposed method is more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.<br />
One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models have become necessary. In this work, the authors take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a “glimpse”. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process.<br />
<br />
= Deep Recurrent Visual Attention Model:=<br />
<br />
For simplicity, they first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention based model is a sequential process with N steps, where each step consists of a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. A graphical representation of the proposed model is shown in Figure 1.<br />
<br />
[[File:0.PNG | center]]<br />
<br />
The above model can be broken down into a number of sub-components, each mapping some input into a vector output. In this paper the term “network” is used to describe these sub-components.<br />
<br />
Glimpse Network:<br />
<br />
The job of the glimpse network is to extract a set of useful features from location of a glimpse of the raw visual input. The glimpse network is a non-linear function that receives the current input image patch, or glimpse (<math>x_n</math>), and its location tuple (<math>l_n</math>) as input and outputs a vector showing that what location has what features. <br />
There are two separate networks in the structure of glimpse network, each of which has its own input. The first one which extracts features of the image patch takes an image patch as input and consists of three convolutional hidden layers without any pooling layers followed by a fully connected layer. Separately, the location tuple is mapped using a fully connected hidden layer. Then element-wise multiplication of two output vectors produces the final glimpse feature vector <math>g_n</math>.<br />
<br />
Recurrent Network:<br />
<br />
The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step.<br />
The recurrent network consists of two recurrent layers. Two outputs of the recurrent layers are defined as <math>r_n^{(1)}</math> and <math>r_n^{(2)}</math>.<br />
<br />
Emission Network:<br />
<br />
The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It consists of a fully connected hidden layer that maps the feature vector <math>r_n^{(2)}</math> from the top recurrent layer to a coordinate tuple <math>l_{n+1}</math>.<br />
<br />
Context Network:<br />
<br />
The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C(.) takes a down-sampled low-resolution version of the whole input image <math>I_coarse</math> and outputs a fixed length vector <math>c_I</math> . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image <math>I_coarse</math> to a feature vector.<br />
<br />
Classification Network:<br />
<br />
The classification network outputs a prediction for the class label y based on the final feature vector <math>r_N^{(1)}</math> of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.<br />
<br />
In order to prevent the model to learn from contextual information than by combining information from different glimpses, the context network and classification network are connected to different recurrent layers in the deep model. This will help the deep recurrent attention model learn to look at locations that are relevant for classifying objects of interest.<br />
<br />
= Learning Where and What=<br />
<br />
Given the class labels y of image “I”, learning can be formulated as a supervised classification problem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. We can thus maximize likelihood of the class label by marginalizing over the glimpse locations.<br />
<br />
[[File:2eq.PNG | center]]<br />
<br />
Using some simplifications, the practical algorithm to train the deep attention model can be expressed as:<br />
<br />
[[File:3.PNG | center]]<br />
<br />
Where <math>\tilde{l^m}</math> is an approximation of location of glimpse “m”.This means that we can sample he glimpse location prediction from the model after each glimpse. In the above equation, log likelihood (in the second term) has an unbounded range that can introduce substantial high variance in the gradient estimator and sometimes induce an undesired large gradient update that is backpropagated through the rest of the model. So in this paper this term is replaced with a 0/1 discrete indicator function (R) and a baseline technique(b) is used to reduce variance in the estimator. <br />
<br />
[[File:4eq.PNG | center]]<br />
<br />
So the gradient update can be expressed as following:<br />
<br />
[[File:5.PNG | center]]<br />
<br />
In fact, by using the 0/1 indicator function, the learning rule from the above equation is equivalent to the REINFORCE learning model where R is the expected reward.<br />
During inference, the feedforward location prediction can be used as a deterministic prediction on<br />
the location coordinates to extract the next input image patch for the model. Alternatively, our marginalized objective function suggests a procedure to estimate the expected class prediction by using samples of location sequences <math>\{\tilde{l_1^m},\dots,\tilde{l_N^m}\}</math> and averaging their predictions.<br />
<br />
[[File:6.PNG | center]]<br />
<br />
= Multi Object/Sequence Classification as a Visual Attention Task=<br />
<br />
Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the recurrent network, in this case, the multiple object labels for a given image need to be cast into an ordered sequence {y1,...,ys}. Assuming there are S targets in an image, the objective function for the sequential prediction is:<br />
<br />
[[File:7.PNG | center]]<br />
<br />
= Experiments:=<br />
<br />
To show the effectiveness of the deep recurrent attention model (DRAM), multi-object classification tasks are investigated on two different datasets: MNIST and multi-digit SVHN.<br />
<br />
MNIST Dataset Results:<br />
<br />
Two main evaluation of the method is done using MNIST dataset:<br />
<br />
1)Learning to find digits<br />
<br />
2)Learning to do addition (The model has to find where each digit is and add them up. The task is to predict the sum of the two digits in the image as a classification problem)<br />
<br />
The results for both experiments are shown in table 1 and table 2. As stated in the tables, the DRAM model with a context network significantly outperforms the other models.<br />
<br />
[[File:8.PNG | center]]<br />
<br />
SVHN Dataset Results:<br />
<br />
The publicly available multi-digit street view house number (SVHN) dataset consists of images of digits taken from pictures of house fronts. This experiment is more challenging and We trained a model to classify all the digits in an image sequentially. Two different model are implemented in this experiment:<br />
First, the label sequence ordering is chosen to go from left to right as the natural ordering of the house number.in this case, there is a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right. Therefore, a second recurrent attention model to “read” the house numbers from right to left as a backward model is trained. The forward and backward model can share the same weights for their glimpse networks but they have different weights for their recurrent and their emission networks. The model performance is shown in table 3:<br />
<br />
[[File:9.PNG | center]]<br />
<br />
As shown in the table, the proposed deep recurrent attention model (DRAM) outperforms the state-ofthe-<br />
art deep ConvNets on the standard SVHN sequence recognition task.<br />
<br />
= Discussion and Conclusion:=<br />
<br />
The recurrent attention models only process a selected subset of the input have less computational cost than a ConvNet that looks over an entire image. Also, they can naturally work on images of different size with the same computational cost independent of the input dimensionality. Moreover, the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=mULTIPLE_OBJECT_RECOGNITION_WITH_VISUAL_ATTENTION&diff=27062mULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION2015-12-04T03:28:24Z<p>Derek: /* Introduction */</p>
<hr />
<div>= Introduction =<br />
<br />
Recognizing multiple objects in images has been one of the most important goals of computer vision. Previous work in this classification of sequences of characters often employed a sliding window detector with an individual character-classifier. In this paper an attention-based model for recognizing multiple objects in images is presented. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. It has been shown that the proposed method is more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.<br />
One of the main drawbacks of convolutional networks (ConvNets) is their poor scalability with increasing input image size so efficient implementations of these models have become necessary. In this work, the authors take inspiration from the way humans perform visual sequence recognition tasks such as reading by continually moving the fovea to the next relevant object or character, recognizing the individual object, and adding the recognized object to our internal representation of the sequence. The proposed system is a deep recurrent neural network that at each step processes a multi-resolution crop of the input image, called a “glimpse”. The network uses information from the glimpse to update its internal representation of the input, and outputs the next glimpse location and possibly the next object in the sequence. The process continues until the model decides that there are no more objects to process.<br />
<br />
= Deep Recurrent Visual Attention Model:=<br />
<br />
For simplicity, they first describe how our model can be applied to classifying a single object and later show how it can be extended to multiple objects. Processing an image x with an attention based model is a sequential process with N steps, where each step consists of a glimpse. At each step n, the model receives a location ln along with a glimpse observation xn taken at location ln. The model uses the observation to update its internal state and outputs the location ln+1 to process at the next time-step. A graphical representation of the proposed model is shown in Figure 1.<br />
<br />
[[File:0.PNG | center]]<br />
<br />
The above model can be broken down into a number of sub-components, each mapping some input into a vector output. In this paper the term “network” is used to describe these sub-components.<br />
<br />
Glimpse Network:<br />
<br />
The job of the glimpse network is to extract a set of useful features from location of a glimpse of the raw visual input. The glimpse network is a non-linear function that receives the current input image patch, or glimpse (<math>x_n</math>), and its location tuple (<math>l_n</math>) as input and outputs a vector showing that what location has what features. <br />
There are two separate networks in the structure of glimpse network, each of which has its own input. The first one which extracts features of the image patch takes an image patch as input and consists of three convolutional hidden layers without any pooling layers followed by a fully connected layer. Separately, the location tuple is mapped using a fully connected hidden layer. Then element-wise multiplication of two output vectors produces the final glimpse feature vector <math>g_n</math>.<br />
<br />
Recurrent Network:<br />
<br />
The recurrent network aggregates information extracted from the individual glimpses and combines the information in a coherent manner that preserves spatial information. The glimpse feature vector gn from the glimpse network is supplied as input to the recurrent network at each time step.<br />
The recurrent network consists of two recurrent layers. Two outputs of the recurrent layers are defined as <math>r_n^{(1)}</math> and <math>r_n^{(2)}</math>.<br />
<br />
Emission Network:<br />
<br />
The emission network takes the current state of recurrent network as input and makes a prediction on where to extract the next image patch for the glimpse network. It acts as a controller that directs attention based on the current internal states from the recurrent network. It consists of a fully connected hidden layer that maps the feature vector <math>r_n^{(2)}</math> from the top recurrent layer to a coordinate tuple <math>l_{n+1}</math>.<br />
<br />
Context Network:<br />
<br />
The context network provides the initial state for the recurrent network and its output is used by the emission network to predict the location of the first glimpse. The context network C(.) takes a down-sampled low-resolution version of the whole input image <math>I_coarse</math> and outputs a fixed length vector <math>c_I</math> . The contextual information provides sensible hints on where the potentially interesting regions are in a given image. The context network employs three convolutional layers that map a coarse image <math>I_coarse</math> to a feature vector.<br />
<br />
Classification Network:<br />
<br />
The classification network outputs a prediction for the class label y based on the final feature vector <math>r_N^{(1)}</math> of the lower recurrent layer. The classification network has one fully connected hidden layer and a softmax output layer for the class y.<br />
<br />
In order to prevent the model to learn from contextual information than by combining information from different glimpses, the context network and classification network are connected to different recurrent layers in the deep model. This will help the deep recurrent attention model learn to look at locations that are relevant for classifying objects of interest.<br />
<br />
= Learning Where and What=<br />
<br />
Given the class labels y of image “I”, learning can be formulated as a supervised classification problem with the cross entropy objective function. The attention model predicts the class label conditioned on intermediate latent location variables l from each glimpse and extracts the corresponding patches. We can thus maximize likelihood of the class label by marginalizing over the glimpse locations.<br />
<br />
[[File:2eq.PNG | center]]<br />
<br />
Using some simplifications, the practical algorithm to train the deep attention model can be expressed as:<br />
<br />
[[File:3.PNG | center]]<br />
<br />
Where <math>\tilde{l^m}</math> is an approximation of location of glimpse “m”.This means that we can sample he glimpse location prediction from the model after each glimpse. In the above equation, log likelihood (in the second term) has an unbounded range that can introduce substantial high variance in the gradient estimator and sometimes induce an undesired large gradient update that is backpropagated through the rest of the model. So in this paper this term is replaced with a 0/1 discrete indicator function (R) and a baseline technique(b) is used to reduce variance in the estimator. <br />
<br />
[[File:4eq.PNG | center]]<br />
<br />
So the gradient update can be expressed as following:<br />
<br />
[[File:5.PNG | center]]<br />
<br />
In fact, by using the 0/1 indicator function, the learning rule from the above equation is equivalent to the REINFORCE learning model where R is the expected reward.<br />
During inference, the feedforward location prediction can be used as a deterministic prediction on<br />
the location coordinates to extract the next input image patch for the model. Alternatively, our marginalized objective function suggests a procedure to estimate the expected class prediction by using samples of location sequences <math>\{\tilde{l_1^m},\dots,\tilde{l_N^m}\}</math> and averaging their predictions.<br />
<br />
[[File:6.PNG | center]]<br />
<br />
= Multi Object/Sequence Classification as a Visual Attention Task=<br />
<br />
Our proposed attention model can be easily extended to solve classification tasks involving multiple objects. To train the recurrent network, in this case, the multiple object labels for a given image need to be cast into an ordered sequence {y1,...,ys}. Assuming there are S targets in an image, the objective function for the sequential prediction is:<br />
<br />
[[File:7.PNG | center]]<br />
<br />
= Experiments:=<br />
<br />
To show the effectiveness of the deep recurrent attention model (DRAM), multi-object classification tasks are investigated on two different datasets: MNIST and multi-digit SVHN.<br />
<br />
MNIST Dataset Results:<br />
<br />
Two main evaluation of the method is done using MNIST dataset:<br />
<br />
1)Learning to find digits<br />
<br />
2)Learning to do addition (The model has to find where each digit is and add them up. The task is to predict the sum of the two digits in the image as a classification problem)<br />
<br />
The results for both experiments are shown in table 1 and table 2. As stated in the tables, the DRAM model with a context network significantly outperforms the other models.<br />
<br />
[[File:8.PNG | center]]<br />
<br />
SVHN Dataset Results:<br />
<br />
The publicly available multi-digit street view house number (SVHN) dataset consists of images of digits taken from pictures of house fronts. This experiment is more challenging and We trained a model to classify all the digits in an image sequentially. Two different model are implemented in this experiment:<br />
First, the label sequence ordering is chosen to go from left to right as the natural ordering of the house number.in this case, there is a performance gap between the state-of-the-art deep ConvNet and a single DRAM that “reads” from left to right. Therefore, a second recurrent attention model to “read” the house numbers from right to left as a backward model is trained. The forward and backward model can share the same weights for their glimpse networks but they have different weights for their recurrent and their emission networks. The model performance is shown in table 3:<br />
<br />
[[File:9.PNG | center]]<br />
<br />
As shown in the table, the proposed deep recurrent attention model (DRAM) outperforms the state-ofthe-<br />
art deep ConvNets on the standard SVHN sequence recognition task.<br />
<br />
= Discussion and Conclusion:=<br />
<br />
The recurrent attention models only process a selected subset of the input have less computational cost than a ConvNet that looks over an entire image. Also, they can naturally work on images of different size with the same computational cost independent of the input dimensionality. Moreover, the attention-based model is less prone to over-fitting than ConvNets, likely because of the stochasticity in the glimpse policy during training.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=on_the_difficulty_of_training_recurrent_neural_networks&diff=27061on the difficulty of training recurrent neural networks2015-12-04T03:22:57Z<p>Derek: /* The Temporal Order Problem */</p>
<hr />
<div>= Introduction =<br />
<br />
Training Recurrent Neural Networks (RNNs) is difficult, one of the most prominent problem in training RNNs has been the vanishing and exploding gradient problem <ref name="yoshua1993">Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pages<br />
1183–1188. IEEE, 1993.</ref> which prevents nerual networks from learning and fitting the data. In this paper the authors propose a gradient norm cliping stragtegy to deal with exploding gradients and a soft constraint for the vanishing gradients problem.<br />
<br />
= Background =<br />
<br />
[[Image:rnn_2.png|frame|center|400px|alt=| Recurrent Neural Network Unrolled in time <ref name="pascanu">Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.></ref>]]<br />
A generic recurrent neural network has the form:<br />
<br />
<math>x_{t} = F(\mathbf{x}_{t -1}, \mathbf{u}_{t}, \theta)</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{x}_{t}</math> is the state at time <math>t</math></span><br />
* <span><math>\mathbf{u}_{t}</math> is the input at time <math>t</math></span><br />
* <span><math>\theta\,</math> are the parameters</span><br />
* <span><math>F()\,</math> is the function that represents a neuron</span><br />
<br />
In the theoreical sections the authors made use of specific parameterization:<br />
<br />
<math>\mathbf{x}_{t} = \mathbf{W}_{rec} \sigma(\mathbf{x}_{t - 1}) + \mathbf{W}_{in} \mathbf{u}_{t} + b</math><br />
<br />
Where:<br />
<br />
* <span><math>\mathbf{W}_{rec}</math> is the RNN weights matrix</span><br />
* <span><math>\sigma()\,</math> is an element wise function</span><br />
* <span><math>b\,</math> is the bias</span><br />
* <span><math>\mathbf{W}_{in}</math> is the input weights matrix</span><br />
<br />
The following are gradients equations for using the Back Propagation Through Time (BPTT) algorithm, the authors rewrote the equations in order to highlight the exploding gradents problem:<br />
<br />
<math>\frac{\partial \varepsilon}{\partial \theta} = <br />
\sum_{1 \leq t \leq T} \frac{\partial \varepsilon_t}{\partial \theta}</math><br />
<br />
<math>\frac{\partial \varepsilon_{t}}{\partial \theta} = <br />
\sum_{1 \leq k \leq T} <br />
\left(<br />
\frac{\partial \varepsilon_{t}}{\partial x_{t}}<br />
\frac{\partial x_{t}}{\partial x_{k}}<br />
\frac{\partial^{+} x_{k}}{\partial \theta}<br />
\right)</math><br />
<br />
<math>\frac{\partial x_{t}}{\partial x_{k}} =<br />
\prod_{t \leq i \leq k} \frac{\partial x_{i}}{\partial x_{i - 1}} =<br />
\prod_{t \leq i \leq k} <br />
\mathbf{W}^{T}_{rec} \textit{diag}(\sigma^{\prime}(x_{i - 1}))</math><br />
<br />
Where:<br />
<br />
* <span><math>\varepsilon_{t}</math> is the error obtained from output at time <math>t</math></span><br />
* <span><math>\frac{\partial^{+} \mathbf{x}_{k}}{\partial \theta}</math> is the immediate partial derivative of state <math>\mathbf{x}_{k}</math></span>. For the parametrization above, <math>\frac{\partial^+ \mathbf{x}_k}{\partial \mathbf{W}_{rec}} = \sigma(\mathbf{x}_{k-1})</math>.<br />
<br />
The authors of this paper also distinguish between ''long-term'' and ''short-term'' contributions to the gradient with respect to <math>\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}</math>. The contribution is ''long-term'' if <math>k \ll t</math>, and ''short-term'' otherwise.<br />
<br />
== Exploding and Vanishing Gradients ==<br />
=== The Mechanics ===<br />
<br />
It's known that <math> |\sigma^'(x)| </math> is bounded. Let <math>\left|\left|diag(\sigma^'(x_k))\right|\right| \leq \gamma \in R</math>.<br />
<br />
The paper first proves that it is sufficient for <math> \lambda_1 < \frac{1}{\gamma} </math>, where <math> \lambda_1 </math> is the largest singular value of <math> \bold{W}_{rec} </math>, for the vanishing gradient problem to occur. The Jacobian matrix <math> \frac{\partial x_{k+1}}{\partial x_k} </math> is given by <math> \bold{ W}_{rec}^{T}diag(\sigma^'(x_k)) </math>. Then, the 2-norm of this Jacobian is bounded by the product of the norms of the two matrices. This leads to <math> \forall k, ||\frac{\partial{x_{k+1}}}{\partial x_k}|| \leq ||\bold{W}_{rec}^T||||diag(\sigma^'(x_k))|| < \frac{1}{\gamma}\gamma < 1</math><br />
<br />
Let <math>\eta \in R</math> be such that <math>\forall k, ||\frac{\partial {x_{k+1}}}{\partial x_k}|| \leq \eta < 1</math>. By induction over <math>i</math>, we can show that <math>||\frac{\partial \varepsilon_t}{\partial x_t}(\prod_{i=k}^{t-1}{\frac{\partial x_{i+1}}{\partial x_i}})|| \leq \eta^{t-k}||\frac{\partial \varepsilon_t}{\partial x_t}||</math>. Since <math> \eta < 1 </math>, as <math> t-k </math> goes larger, the gradient goes to 0.<br />
<br />
By inverting this proof, it also shows that when the largest singular value <math>\lambda_1 </math> is larger than <math> \frac{1}{\gamma}</math>, we will have exploding gradients (otherwise the long term components would vanish instead of exploding).<br />
<br />
=== From a dynamical systems perspective ===<br />
<br />
Drawing from a dynamical systems perspective similiar to <ref name="yoshua1993"></ref><ref name="doya1993">Kenji Doya. Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on neural networks, 1:75–80, 1993.</ref>, dynamical systems theory states that as “<math>\theta</math> changes the asymptotic behaviour changes smoothly almost every where except for cetian crucial points where drasic changes occur” <ref name="pascanu"></ref>, this is because crossing these bifurcation has the potential to cause gradients to explode <ref name="doya1993"></ref>.<br />
<br />
The authors of this paper argue, however, that crossing these bifurcation points does not guarantee a sudden chage in gradients. Their idea is that a change to the model parameters can alter the attractor landscape in such a way that basin of attraction corresponding to the current model state is unaltered. For example, a change to the model parameters might eliminate a basic of attraction in a portion of the model's state space that is very far from its current state. In this case, the bifurcation will have no effect on the asymptotic behaviour of the model, and there will accordingly be no gradient explosion. On the other hand, if a change to the model parameters substantially alters the final basin of attraction given the current state, then there will a considerable effect on the asymptotic behaviour of the model, and the gradients will accordingly explode. <br />
<br />
[[Image:dynamic_perspective.png|frame|center|400px|alt=|Bifurcation diagram of Single Hidden Unit RNN <ref name="pascanu"></ref>]]<br />
<br />
The figure above depicts a bifurcation diagram for a single-unit RNN, where the x-axis is the parameter <math>b</math> (bias) and the y-axis is the asymptotoc state <math>x_{\infty}</math>, (i.e. the equilibrium activation value of the unit), and the plot line is the movement of the final point attractor <math>x_{\infty}</math> as <math>b</math> changes. What this figure represents is the presence of two attractors, one emerging from <math>b_1</math> and another disappearing at <math>b_2</math>, as the value of <math>b</math> is decreased. Note that only one attractor exists when the value of <math>b</math> is outside of the interval between <math>b_1</math> and <math>b_2</math>, and that when two attractors exist, the attractor state towards which the unit ultimately gravitates is determined by its initial starting state. The boundary between the these two basins of attraction is denoted with the dashed line - starting states on opposite sides of this boundary will gravitate towards different attractor states. The blue filled circles indicate a bifurcation point at which a small change to the value of <math>b</math> can have a drastic effect on the attractor landscape over the unit's state space. In short, the landscape shifts to include a single attractor state for a low value of <math>x</math>. The unfilled green circles represents Pascanu’s (2013) extension of Doya’s hypothesis, where if the model is in the boundary range at time <math>0</math>, a small change in <math>b</math> would result in a sudden large change in <math>x_{t}</math>.<br />
<br />
Overall, these remarks indicate that, when treated as dynamical system, the behaviour of a RNN can be analyzed with respect to both changes to the parameter values that determine an attractor landscape over its state space (assuming a fixed starting state), and with respect to changes to the starting state (assuming a fixed attractor landscape). <br />
<br />
=== From a geometric perspective ===<br />
<br />
Aside from a dynamical systems prospective to exploding and vanishing gradients, the authors also considered a geometric perspective, where a simple one hidden unit RNN was considered.<br />
<br />
[[Image:geometric_perspective.png|frame|center|400px|alt=|Error Loss surface of single hidden unit RNN <ref name="pascanu"></ref>]]<br />
<br />
Reusing the RNN state equation the authors defined:<br />
<br />
<math>x_{t} = W_{rec} \sigma(x_{t - 1}) + W_{in} u_{t} + b</math><br />
<br />
By assuming no input, with <math>b = 0</math> and initial state <math>x_{0}</math>, the equation simplifies to:<br />
<br />
<math>x_{t} = W_{rec}^{t} x_{0}</math><br />
<br />
Differentiating the above equation to the first and second order would give:<br />
<br />
<math>\frac{\delta x_{t}}{\delta \omega} = t W_{rec}^{t - 1} x_{0}</math><br />
<br />
<math>\frac{\delta^{2} x_{t}}{\delta \omega^{2}} = t (t - 1) W_{rec}^{t - 2} x_{0}</math><br />
<br />
Which implies if the first order derivative explodes so will the second derivative. In the general case, when they gradients explode they do so along some directions '''v'''. If this bound is tight, it is hypothesized that ''when gradients explode so does the curvature along'' '''v''''', leading to a wall in the error surface'', like the one seen above. If both the gradient and the leading eigenvector of the curvature are aligned with the exploding direction '''v''', it follows that the error surface has a steep wall perpendicular to '''v''' (and consequently to the gradient). This means when the Stochastic Gradient decent (SGD) approaches the loss error surface and attempts to step into it, it will be deflected away, possibly hindering the learning process. (See figure above). Note that this solution assumes that the valley bordered by a steep cliff in the value of the loss function is wide enough with respect the clip being applied to the gradient - otherwise, the deflection caused by an update of SGD would still hinder the learning process, even when clipping is used. The practical effectiveness of clipping provides some evidence in support of this assumption.<br />
<br />
= Dealing with Exploding and Vanishing Gradient =<br />
<br />
== Related Work ==<br />
<br />
* <span>'''L1/L2 Regularization''': Helps with exploding gradients, but limits the RNN to a single point attractor at the origin, and prevents the model to learn generator models or exhibit long term memory traces.</span><br />
* <span>'''Teacher Forcing''': Proposed by <ref name="doya1993"></ref>, performs a cross bifurcation boundary if the model does not exhibit asymptotic behavior towards a desired target. This assumes the user knows what the behaviour might look like or how to intialize the model to reduce exploding gradients.</span><br />
* <span>'''LTSM''': The Long-Short Term Memory architecture by <ref name="graves2009">Alex Graves, Marcus Liwicki, Santiago Fern ́andez, Roman Bertolami, Horst Bunke, and Jurgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855–868, 2009</ref><ref name="Hochreiter">Sepp Hochreiter and Jurgen Schmidhuber. 9(8):1735–1780, 1997. Long short-term memory.Neural computation,</ref> is an attempt to deal with the vanishing gradient problem by using a linear unit that feedbacks to itself with a weight of <math>1</math>. This solution however does not deal with the exploding gradient</span><br />
* <span>'''Hessian-Free optimizer with structural damping''': Proposed by <ref>Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011</ref> addresses the vanishing and exploding gradient problem. <ref name="pascanu"></ref> reasons that this approach solves the vanishing gradient problem because of the high dimensionality of the spaces gives rise to a high probability for the long term components to be orthogonal to short term components. Additionally for exploding gradient the curvature of the gradient is taken into account.</span><br />
* <span>'''Echo State Networks''': <ref>Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.</ref> avoids the exploding and vanishing gradient problem by not learning the input and recurrent weights, they are instead hand crafted distributions that prevent information from getting loss, since a spectral radius for the recurrent weights matrix is usually smaller than 1.</span><br />
<br />
== Scaling Down the Gradients ==<br />
<br />
[[Image:gradient_clipping.png|frame|center|400px]]<br />
<br />
The intuition of this gradient clipping algorithm is simple, obtain the norm of the gradients, if it is larger than the set threshold then scale the gradients by a constant defined as the treshold divided by the norm of gradients. <ref name="pascanu"></ref> suggests using a threshold value from half to ten times the norm.<br />
<br />
== Vanishing Gradient Regularization ==<br />
<br />
The vanishing gradient regularizer is as follows:<br />
<br />
<math>\Omega <br />
= \sum_{k} \Omega_{k} <br />
= \sum_{k} <br />
\left( <br />
\frac{<br />
\| <br />
\frac{\delta \varepsilon}{\delta x_{k + 1}} <br />
\frac{\delta x_{k + 1}}{\delta x_{k}}<br />
\|<br />
}<br />
{<br />
\|<br />
\frac{\delta \varepsilon}{\delta x_{k + 1}}<br />
\| <br />
} - 1<br />
\right)^{2}</math><br />
<br />
The vanishing gradient problem occurs when at time <math>t</math> the inputs <math>u</math> may be irrelevant and noisy and the network starts to learn to ignore them. However this is not desirable as the model will end up not learning anything. The authors found that the sensitivity to all inputs <math>u_{t} \dots u_{k}</math> could be increased by increasing the norm of <math>\frac{\delta x_t}{\delta x_{t}}</math>. This imples that in order to increses the <math>\frac{\delta x_t}{\delta x_{t}}</math> norm the error must remain large, this however would prevent the model from converging, thus the authors argue a regularizer is a more natural choice. The regularizer is a soft constraint that forces the Jacobian matrices <math>\frac{\delta x_{k + 1}}{\delta x_{k}}</math> to preserve norm in the direction of the error <math>\frac{\delta \varepsilon}{\delta x_{k + 1}}</math>.<br />
<br />
= Experimental Results =<br />
<br />
== The Temporal Order Problem ==<br />
<br />
The authors repeated the temporal order problem as the prototypical pathological problem for validating the cliping and regularizer devised. The temporal order problem involves generating a long sequence of discrete symbols, and at the beginning an <math>A</math> or a <math>B</math> symbol is placed at the beginning and middle of the sequence. The task is to correctly classify the order of <math>A</math> and <math>B</math> at the end of the sequence.<br />
<br />
Three different RNN intializations were performed for the experiment:<br />
<br />
* <span>'''sigmoid unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
* <span>'''basic tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.1)</math></span><br />
* <span>'''smart tanh unit network''': <math>W_{rec}, W_{in}, W_{out} \sim \mathcal{N}(0, 0.01)</math></span><br />
<br />
Of the three RNN networks three different optimizer configurations were used:<br />
<br />
* <span>'''MSGD''': Mini-batch Stochastic Gradient Decent</span><br />
* <span>'''MSGD-C''': MSGD with Gradient Clipping</span><br />
* <span>'''MSGD-CR''': MSGD-C with Regularization</span><br />
<br />
Additional model parameters include:<br />
<br />
* <span>Spectral Radius of 0.95</span><br />
* <span><math>b = b_{out} = 0</math></span><br />
* <span>50 hidden neurons</span><br />
* <span>constant learning rate of 0.01</span><br />
* <span>clipping threshold of 1 (only for MSGD-C and MSGD-CR)</span><br />
* <span>regularization weight of 4 (MSGD-CR)</span><br />
<br />
The experiment was performed 5 times, from the figure below we can observe the importance of gradient cliping and the regularizer, in all cases the combination of the two methods yielded the best results regardless of which unit network was used. Furthermore this experiment provided empirical evidence that exploding graidents correlates to tasks that require long memory traces, as can be seen as the sequence length of the problem increases clipping and regularization becomes more important. This is largely due to the fact that increased memory yields a larger spectral radius, which in turn leads to increased likelihood of gradient explosion.<br />
<br />
[[Image:experimental_results.png|frame|center|400px]]<br />
<br />
== Other Pathological Problems ==<br />
<br />
The author repeated other pathological problems from <ref name="Hochreiter"></ref>. The results are listed below:<br />
<br />
[[Image:experimental_results_2.png|image]]<br />
<br />
The authors did not discuss the experimental results in detail here.<br />
<br />
= Summary =<br />
<br />
The paper explores two different perspectives in explaining the exploding and vanishing gradient problems in training RNNs via the dynamical systems and geometric approach. The authors devised methods to mitigate the corresponding problems by introducing a gradient clipping and a gradient vanishing regularizer, their experimental results show that in all cases except for the Penn Treebank dataset, that cliping and regularizer has bested the state of the art for RNN in their respective experiment performances.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26961learning Fast Approximations of Sparse Coding2015-11-27T18:01:47Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.The main contribution of this paper is a highly efficient learning-based method that computes good approximations of optimal sparse codes in a fixed amount of time.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
Here baseline iterative shrinkage algorithms for finding sparse codes are introduced and explained. The ISTA and FISTA methods update the whole code vector in parallel, while the more efficient Coordinate Descent method (CoD) updates the components one at a time and carefully selects which component to update at each step.<br />
Both methods refine the initial guess through a form of mutual inhibition between code component, and component-wise shrinkage.<br />
<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
The CoD algorithm is presented below:<br />
<br />
<blockquote><br />
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math><br />
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math><br />
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math><br />
: <math> \textbf{repeat}</math><br />
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math><br />
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math><br />
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math><br />
:: <math> Z_k = \bar{Z}_k</math><br />
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math> <br />
: <math> Z = h_{\alpha}\left(B\right)</math><br />
<math> \textbf{end} \, \textbf{function} </math><br />
</blockquote><br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
The algorithm for LCoD can be summarized as <br />
<br />
<br />
[[File:Q12.png]]<br />
<br />
<br />
A main advantage of the system proposed in this paper is speed, so it is necessary to take note of the asymptotic complexity of the above algorithm: only <math>\, O(m)</math> operations are required for each step of the bprop procedure, and each iteration only requires <math>\, O(m)</math> space; as almost all of the stored variables are scalar, with the exception of <math>\, B(T)</math>. (Recall that m refers to the number of dimensions in the new feature space with the sparse representations.)<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their CoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their CoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations. <br />
<br />
A complete feature vector consisted of 25 concatenated such vectors, extracted<br />
from all 16 × 16 patches shifted by 3 pixels on the input.<br />
The features were extracted for all digits using<br />
CoD with exact inference, CoD with a fixed number of<br />
iterations, and LCoD. Additionally a version of CoD<br />
(denoted CoD’) used inference with a fixed number<br />
of iterations during training of the filters, and used<br />
the same number of iterations during test (same complexity<br />
as LCoD). A logistic regression classifier was<br />
trained on the features thereby obtained.<br />
<br />
Classification errors on the test set are shown in the following figures . While the error rate decreases with the<br />
number of iterations for all methods, the error rate<br />
of LCoD with 10 iterations is very close to the optimal<br />
(differences in error rates of less than 0.1% are<br />
insignificant on MNIST)<br />
<br />
[[File:T1.png]]<br />
<br />
MNIST results with 784-D sparse codes<br />
<br />
MNIST results with 25 256-D sparse codes extracted<br />
from 16 × 16 patches every 3 pixels<br />
<br />
<br />
[[File:T2.png]]<br />
<br />
= Conclusions =<br />
<br />
The idea of time unfolding an inference algorithm in order to construct a fixed-depth network in application to sparse coding is introduced in this paper. In sparse coding, inference algorithms are iterative and converge to a fixed point. In this paper it is proposed to unroll an inference algorithm for a fixed number of iterations in order to define an approximator network.The main result of this paper is the demonstration that the number of iterations required to reach a given code prediction error can be heavily reduced - by a factor of about 20 - when learning the filters and mutual inhibition matrices FISTA and CoD, when truncated. In other words, not much data-specific mutual inhibition is required to handle the phenomenon of "explaining away" superfluous parts of the code vector.<br />
<br />
= References =<br />
References<br />
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding<br />
algorithm with application to waveletbased<br />
image deblurring. ICASSP’09, pp. 693–696, 2009.<br />
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic<br />
decomposition by basis pursuit. SIAM review, 43(1):<br />
129–159, 2001.<br />
<br />
Daubechies, I, Defrise, M., and De Mol, C. An iterative<br />
thresholding algorithm for linear inverse problems with a<br />
sparsity constraint. Comm. on Pure and Applied Mathematics,<br />
57:1413–1457, 2004.<br />
<br />
Donoho, D.L. and Elad, M. Optimally sparse representation<br />
in general (nonorthogonal) dictionaries via ℓ<br />
1 minimization.<br />
PNAS, 100(5):2197–2202, 2003.<br />
<br />
Elad, M. and Aharon, M. Image denoising via learned dictionaries<br />
and sparse representation. In CVPR’06, 2006.<br />
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation<br />
for l1-minimization: Methodology and convergence.<br />
SIAM J. on Optimization, 19:1107, 2008.<br />
Hoyer, P. O. Non-negative matrix factorization with<br />
sparseness constraints. JMLR, 5:1457–1469, 2004.<br />
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,<br />
Y. What is the best multi-stage architecture for object<br />
recognition? In ICCV’09. IEEE, 2009.<br />
<br />
Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,<br />
Yann. Fast inference in sparse coding algorithms<br />
with applications to object recognition. Technical Report<br />
CBLL-TR-2008-12-01, Computational and Biological<br />
Learning Lab, Courant Institute, NYU, 2008.<br />
<br />
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient<br />
sparse coding algorithms. In NIPS’06, 2006.<br />
<br />
Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief<br />
net model for visual area v2. In Advances in Neural<br />
Information Processing Systems, 2007.<br />
<br />
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional<br />
deep belief networks for scalable unsupervised<br />
learning of hierarchical representations. In International<br />
Conference on Machine Learning. ACM New York, 2009.<br />
Li, Y. and Osher, S. Coordinate descent optimization for<br />
l1 minimization with application to compressed sensing;<br />
a greedy algorithm. Inverse Problems and Imaging, 3<br />
(3):487–503, 2009.<br />
<br />
Mairal, J., Elad, M., and Sapiro, G. Sparse representation<br />
for color image restoration. IEEE T. Image Processing,<br />
17(1):53–69, January 2008.<br />
<br />
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online<br />
dictionary learning for sparse coding. In ICML’09, 2009.<br />
Olshausen, B.A. and Field, D. Emergence of simple-cell<br />
receptive field properties by learning a sparse code for<br />
natural images. Nature, 381(6583):607–609, 1996.<br />
<br />
Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,<br />
Y. Unsupervised learning of invariant feature hierarchies<br />
with applications to object recognition. In CVPR’07.<br />
IEEE, 2007a.<br />
<br />
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,<br />
Y. A unified energy-based framework for unsupervised<br />
learning. In AI-Stats’07, 2007b.<br />
<br />
Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,<br />
B.A. Sparse coding via thresholding and local<br />
competition in neural circuits. Neural Computation, 20:<br />
2526–2563, 2008.<br />
<br />
Vonesch, C. and Unser, M. A fast iterative thresholding algorithm<br />
for wavelet-regularized deconvolution. In IEEE<br />
ISBI, 2007.<br />
<br />
Wu, T.T. and Lange, K. Coordinate descent algorithms<br />
for lasso penalized regression. Ann. Appl. Stat, 2(1):<br />
224–244, 2008.<br />
<br />
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,<br />
Thomas. Linear spatial pyramid matching using sparse<br />
coding for image classification. In CVPR’09, 2009.<br />
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning<br />
using local coordinate coding. In NIPS’09, 2009.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26960learning Fast Approximations of Sparse Coding2015-11-27T18:01:31Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.The main contribution of this paper is a highly efficient learning-based method that computes good approximations of optimal sparse codes in a fixed amount of time.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
Here baseline iterative shrinkage algorithms for finding sparse codes are introduced and explained. The ISTA and FISTA methods update the whole code vector in parallel, while the more efficient Coordinate Descent method (CoD) updates the components one at a time and carefully selects which component to update at each step.<br />
Both methods refine the initial guess through a form of mutual inhibition between code component, and component-wise shrinkage.<br />
<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
The CoD algorithm is presented below:<br />
<br />
<blockquote><br />
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math><br />
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math><br />
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math><br />
: <math> \textbf{repeat}</math><br />
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math><br />
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math><br />
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math><br />
:: <math> Z_k = \bar{Z}_k</math><br />
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math> <br />
: <math> Z = h_{\alpha}\left(B\right)</math><br />
<math> \textbf{end} \, \textbf{function} </math><br />
</blockquote><br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
The algorithm for LCoD can be summarized as <br />
<br />
<br />
[[File:Q12.png]]<br />
<br />
<br />
A main advantage of the system proposed in this paper is speed, so it is necessary to take note of the asymptotic complexity of the above algorithm: only <math>\, O(m)</math> operations are required for each step of the bprop procedure, and each iteration only requires <math>\, O(m)</math> space; as almost all of the stored variables are scalar, with the exception of <math>\, B(T)</math>. (Recall that m refers to the number of dimensions in the new feature space with the sparse representations.)<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their CoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations. <br />
<br />
A complete feature vector consisted of 25 concatenated such vectors, extracted<br />
from all 16 × 16 patches shifted by 3 pixels on the input.<br />
The features were extracted for all digits using<br />
CoD with exact inference, CoD with a fixed number of<br />
iterations, and LCoD. Additionally a version of CoD<br />
(denoted CoD’) used inference with a fixed number<br />
of iterations during training of the filters, and used<br />
the same number of iterations during test (same complexity<br />
as LCoD). A logistic regression classifier was<br />
trained on the features thereby obtained.<br />
<br />
Classification errors on the test set are shown in the following figures . While the error rate decreases with the<br />
number of iterations for all methods, the error rate<br />
of LCoD with 10 iterations is very close to the optimal<br />
(differences in error rates of less than 0.1% are<br />
insignificant on MNIST)<br />
<br />
[[File:T1.png]]<br />
<br />
MNIST results with 784-D sparse codes<br />
<br />
MNIST results with 25 256-D sparse codes extracted<br />
from 16 × 16 patches every 3 pixels<br />
<br />
<br />
[[File:T2.png]]<br />
<br />
= Conclusions =<br />
<br />
The idea of time unfolding an inference algorithm in order to construct a fixed-depth network in application to sparse coding is introduced in this paper. In sparse coding, inference algorithms are iterative and converge to a fixed point. In this paper it is proposed to unroll an inference algorithm for a fixed number of iterations in order to define an approximator network.The main result of this paper is the demonstration that the number of iterations required to reach a given code prediction error can be heavily reduced - by a factor of about 20 - when learning the filters and mutual inhibition matrices FISTA and CoD, when truncated. In other words, not much data-specific mutual inhibition is required to handle the phenomenon of "explaining away" superfluous parts of the code vector.<br />
<br />
= References =<br />
References<br />
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding<br />
algorithm with application to waveletbased<br />
image deblurring. ICASSP’09, pp. 693–696, 2009.<br />
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic<br />
decomposition by basis pursuit. SIAM review, 43(1):<br />
129–159, 2001.<br />
<br />
Daubechies, I, Defrise, M., and De Mol, C. An iterative<br />
thresholding algorithm for linear inverse problems with a<br />
sparsity constraint. Comm. on Pure and Applied Mathematics,<br />
57:1413–1457, 2004.<br />
<br />
Donoho, D.L. and Elad, M. Optimally sparse representation<br />
in general (nonorthogonal) dictionaries via ℓ<br />
1 minimization.<br />
PNAS, 100(5):2197–2202, 2003.<br />
<br />
Elad, M. and Aharon, M. Image denoising via learned dictionaries<br />
and sparse representation. In CVPR’06, 2006.<br />
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation<br />
for l1-minimization: Methodology and convergence.<br />
SIAM J. on Optimization, 19:1107, 2008.<br />
Hoyer, P. O. Non-negative matrix factorization with<br />
sparseness constraints. JMLR, 5:1457–1469, 2004.<br />
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,<br />
Y. What is the best multi-stage architecture for object<br />
recognition? In ICCV’09. IEEE, 2009.<br />
<br />
Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,<br />
Yann. Fast inference in sparse coding algorithms<br />
with applications to object recognition. Technical Report<br />
CBLL-TR-2008-12-01, Computational and Biological<br />
Learning Lab, Courant Institute, NYU, 2008.<br />
<br />
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient<br />
sparse coding algorithms. In NIPS’06, 2006.<br />
<br />
Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief<br />
net model for visual area v2. In Advances in Neural<br />
Information Processing Systems, 2007.<br />
<br />
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional<br />
deep belief networks for scalable unsupervised<br />
learning of hierarchical representations. In International<br />
Conference on Machine Learning. ACM New York, 2009.<br />
Li, Y. and Osher, S. Coordinate descent optimization for<br />
l1 minimization with application to compressed sensing;<br />
a greedy algorithm. Inverse Problems and Imaging, 3<br />
(3):487–503, 2009.<br />
<br />
Mairal, J., Elad, M., and Sapiro, G. Sparse representation<br />
for color image restoration. IEEE T. Image Processing,<br />
17(1):53–69, January 2008.<br />
<br />
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online<br />
dictionary learning for sparse coding. In ICML’09, 2009.<br />
Olshausen, B.A. and Field, D. Emergence of simple-cell<br />
receptive field properties by learning a sparse code for<br />
natural images. Nature, 381(6583):607–609, 1996.<br />
<br />
Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,<br />
Y. Unsupervised learning of invariant feature hierarchies<br />
with applications to object recognition. In CVPR’07.<br />
IEEE, 2007a.<br />
<br />
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,<br />
Y. A unified energy-based framework for unsupervised<br />
learning. In AI-Stats’07, 2007b.<br />
<br />
Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,<br />
B.A. Sparse coding via thresholding and local<br />
competition in neural circuits. Neural Computation, 20:<br />
2526–2563, 2008.<br />
<br />
Vonesch, C. and Unser, M. A fast iterative thresholding algorithm<br />
for wavelet-regularized deconvolution. In IEEE<br />
ISBI, 2007.<br />
<br />
Wu, T.T. and Lange, K. Coordinate descent algorithms<br />
for lasso penalized regression. Ann. Appl. Stat, 2(1):<br />
224–244, 2008.<br />
<br />
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,<br />
Thomas. Linear spatial pyramid matching using sparse<br />
coding for image classification. In CVPR’09, 2009.<br />
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning<br />
using local coordinate coding. In NIPS’09, 2009.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26958learning Fast Approximations of Sparse Coding2015-11-27T17:52:44Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.The main contribution of this paper is a highly efficient learning-based method that computes good approximations of optimal sparse codes in a fixed amount of time.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
Here baseline iterative shrinkage algorithms for finding sparse codes are introduced and explained. The ISTA and FISTA methods update the whole code vector in parallel, while the more efficient Coordinate Descent method (CoD) updates the components one at a time and carefully selects which component to update at each step.<br />
Both methods refine the initial guess through a form of mutual inhibition between code component, and component-wise shrinkage.<br />
<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent (CoD) adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
The CoD algorithm is presented below:<br />
<br />
<blockquote><br />
<math>\textbf{function} \, \textbf{CoD}\left(X, Z, W_d, S, \alpha\right)</math><br />
: <math>\textbf{Require:} \,S = I - W_d^T W_d</math><br />
: <math>\textbf{Initialize:} \,Z = 0; B = W_d^TX</math><br />
: <math> \textbf{repeat}</math><br />
:: <math>\bar{Z} = h_{\alpha}\left(B\right)</math><br />
:: <math> \,k = \mbox{ index of largest component of} \left|Z - \bar{Z}\right|</math><br />
:: <math> \forall j \in \left[1, m\right]: B_j = B_j + S_{jk}\left(\bar{Z}_k - Z_k\right)</math><br />
:: <math> Z_k = \bar{Z}_k</math><br />
: <math>\textbf{until}\,\text{change in}\,Z\,\text{is below a threshold}</math> <br />
: <math> Z = h_{\alpha}\left(B\right)</math><br />
<math> \textbf{end} \, \textbf{function} </math><br />
</blockquote><br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. This algorithm has a similar feedback concept to ISTA, but can it can expressed as a linear feedback operation with a very sparse matrix (since only one component is updated at a time). Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
The algorithm for LCoD can be summarized as <br />
<br />
<br />
[[File:Q12.png]]<br />
<br />
<br />
A main advantage of the system proposed in this paper is speed, so it is necessary to take note of the asymptotic complexity of the above algorithm: only <math>\, O(m)</math> operations are required for each step of the bprop procedure, and each iteration only requires <math>\, O(m)</math> space; as almost all of the stored variables are scalar, with the exception of <math>\, B(T)</math>. (Recall that m refers to the number of dimensions in the new feature space with the sparse representations.)<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations. <br />
<br />
A complete feature vector consisted of 25 concatenated such vectors, extracted<br />
from all 16 × 16 patches shifted by 3 pixels on the input.<br />
The features were extracted for all digits using<br />
CoD with exact inference, CoD with a fixed number of<br />
iterations, and LCoD. Additionally a version of CoD<br />
(denoted CoD’) used inference with a fixed number<br />
of iterations during training of the filters, and used<br />
the same number of iterations during test (same complexity<br />
as LCoD). A logistic regression classifier was<br />
trained on the features thereby obtained.<br />
<br />
Classification errors on the test set are shown in the following figures . While the error rate decreases with the<br />
number of iterations for all methods, the error rate<br />
of LCoD with 10 iterations is very close to the optimal<br />
(differences in error rates of less than 0.1% are<br />
insignificant on MNIST)<br />
<br />
[[File:T1.png]]<br />
<br />
MNIST results with 784-D sparse codes<br />
<br />
MNIST results with 25 256-D sparse codes extracted<br />
from 16 × 16 patches every 3 pixels<br />
<br />
<br />
[[File:T2.png]]<br />
<br />
= Conclusions =<br />
<br />
The idea of time unfolding an inference algorithm in order to construct a fixed-depth network in application to sparse coding is introduced in this paper. In sparse coding, inference algorithms are iterative and converge to a fixed point. In this paper it is proposed to unroll an inference algorithm for a fixed number of iterations in order to define an approximator network.The main result of this paper is the demonstration that the number of iterations required to reach a given code prediction error can be heavily reduced - by a factor of about 20 - when learning the filters and mutual inhibition matrices FISTA and CoD, when truncated. In other words, not much data-specific mutual inhibition is required to handle the phenomenon of "explaining away" superfluous parts of the code vector.<br />
<br />
= References =<br />
References<br />
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding<br />
algorithm with application to waveletbased<br />
image deblurring. ICASSP’09, pp. 693–696, 2009.<br />
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic<br />
decomposition by basis pursuit. SIAM review, 43(1):<br />
129–159, 2001.<br />
<br />
Daubechies, I, Defrise, M., and De Mol, C. An iterative<br />
thresholding algorithm for linear inverse problems with a<br />
sparsity constraint. Comm. on Pure and Applied Mathematics,<br />
57:1413–1457, 2004.<br />
<br />
Donoho, D.L. and Elad, M. Optimally sparse representation<br />
in general (nonorthogonal) dictionaries via ℓ<br />
1 minimization.<br />
PNAS, 100(5):2197–2202, 2003.<br />
<br />
Elad, M. and Aharon, M. Image denoising via learned dictionaries<br />
and sparse representation. In CVPR’06, 2006.<br />
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation<br />
for l1-minimization: Methodology and convergence.<br />
SIAM J. on Optimization, 19:1107, 2008.<br />
Hoyer, P. O. Non-negative matrix factorization with<br />
sparseness constraints. JMLR, 5:1457–1469, 2004.<br />
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,<br />
Y. What is the best multi-stage architecture for object<br />
recognition? In ICCV’09. IEEE, 2009.<br />
<br />
Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,<br />
Yann. Fast inference in sparse coding algorithms<br />
with applications to object recognition. Technical Report<br />
CBLL-TR-2008-12-01, Computational and Biological<br />
Learning Lab, Courant Institute, NYU, 2008.<br />
<br />
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient<br />
sparse coding algorithms. In NIPS’06, 2006.<br />
<br />
Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief<br />
net model for visual area v2. In Advances in Neural<br />
Information Processing Systems, 2007.<br />
<br />
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional<br />
deep belief networks for scalable unsupervised<br />
learning of hierarchical representations. In International<br />
Conference on Machine Learning. ACM New York, 2009.<br />
Li, Y. and Osher, S. Coordinate descent optimization for<br />
l1 minimization with application to compressed sensing;<br />
a greedy algorithm. Inverse Problems and Imaging, 3<br />
(3):487–503, 2009.<br />
<br />
Mairal, J., Elad, M., and Sapiro, G. Sparse representation<br />
for color image restoration. IEEE T. Image Processing,<br />
17(1):53–69, January 2008.<br />
<br />
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online<br />
dictionary learning for sparse coding. In ICML’09, 2009.<br />
Olshausen, B.A. and Field, D. Emergence of simple-cell<br />
receptive field properties by learning a sparse code for<br />
natural images. Nature, 381(6583):607–609, 1996.<br />
<br />
Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,<br />
Y. Unsupervised learning of invariant feature hierarchies<br />
with applications to object recognition. In CVPR’07.<br />
IEEE, 2007a.<br />
<br />
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,<br />
Y. A unified energy-based framework for unsupervised<br />
learning. In AI-Stats’07, 2007b.<br />
<br />
Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,<br />
B.A. Sparse coding via thresholding and local<br />
competition in neural circuits. Neural Computation, 20:<br />
2526–2563, 2008.<br />
<br />
Vonesch, C. and Unser, M. A fast iterative thresholding algorithm<br />
for wavelet-regularized deconvolution. In IEEE<br />
ISBI, 2007.<br />
<br />
Wu, T.T. and Lange, K. Coordinate descent algorithms<br />
for lasso penalized regression. Ann. Appl. Stat, 2(1):<br />
224–244, 2008.<br />
<br />
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,<br />
Thomas. Linear spatial pyramid matching using sparse<br />
coding for image classification. In CVPR’09, 2009.<br />
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning<br />
using local coordinate coding. In NIPS’09, 2009.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26841learning Fast Approximations of Sparse Coding2015-11-22T20:45:40Z<p>Derek: /* Conclusion */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
The algorithm for LCoD can be summarized as <br />
<br />
<br />
[[File:Q12.png]]<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations. <br />
<br />
A complete feature vector consisted of 25 concatenated such vectors, extracted<br />
from all 16 × 16 patches shifted by 3 pixels on the input.<br />
The features were extracted for all digits using<br />
CoD with exact inference, CoD with a fixed number of<br />
iterations, and LCoD. Additionally a version of CoD<br />
(denoted CoD’) used inference with a fixed number<br />
of iterations during training of the filters, and used<br />
the same number of iterations during test (same complexity<br />
as LCoD). A logistic regression classifier was<br />
trained on the features thereby obtained.<br />
<br />
Classification errors on the test set are shown in the following figures . While the error rate decreases with the<br />
number of iterations for all methods, the error rate<br />
of LCoD with 10 iterations is very close to the optimal<br />
(differences in error rates of less than 0.1% are<br />
insignificant on MNIST)<br />
<br />
[[File:T1.png]]<br />
<br />
MNIST results with 784-D sparse codes<br />
<br />
MNIST results with 25 256-D sparse codes extracted<br />
from 16 × 16 patches every 3 pixels<br />
<br />
<br />
[[File:T2.png]]<br />
<br />
= References =<br />
References<br />
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding<br />
algorithm with application to waveletbased<br />
image deblurring. ICASSP’09, pp. 693–696, 2009.<br />
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic<br />
decomposition by basis pursuit. SIAM review, 43(1):<br />
129–159, 2001.<br />
<br />
Daubechies, I, Defrise, M., and De Mol, C. An iterative<br />
thresholding algorithm for linear inverse problems with a<br />
sparsity constraint. Comm. on Pure and Applied Mathematics,<br />
57:1413–1457, 2004.<br />
<br />
Donoho, D.L. and Elad, M. Optimally sparse representation<br />
in general (nonorthogonal) dictionaries via ℓ<br />
1 minimization.<br />
PNAS, 100(5):2197–2202, 2003.<br />
<br />
Elad, M. and Aharon, M. Image denoising via learned dictionaries<br />
and sparse representation. In CVPR’06, 2006.<br />
Hale, E.T., Yin, W., and Zhang, Y. Fixed-point continuation<br />
for l1-minimization: Methodology and convergence.<br />
SIAM J. on Optimization, 19:1107, 2008.<br />
Hoyer, P. O. Non-negative matrix factorization with<br />
sparseness constraints. JMLR, 5:1457–1469, 2004.<br />
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun,<br />
Y. What is the best multi-stage architecture for object<br />
recognition? In ICCV’09. IEEE, 2009.<br />
<br />
Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun,<br />
Yann. Fast inference in sparse coding algorithms<br />
with applications to object recognition. Technical Report<br />
CBLL-TR-2008-12-01, Computational and Biological<br />
Learning Lab, Courant Institute, NYU, 2008.<br />
<br />
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient<br />
sparse coding algorithms. In NIPS’06, 2006.<br />
<br />
Lee, H., Chaitanya, E., and Ng, A. Y. Sparse deep belief<br />
net model for visual area v2. In Advances in Neural<br />
Information Processing Systems, 2007.<br />
<br />
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional<br />
deep belief networks for scalable unsupervised<br />
learning of hierarchical representations. In International<br />
Conference on Machine Learning. ACM New York, 2009.<br />
Li, Y. and Osher, S. Coordinate descent optimization for<br />
l1 minimization with application to compressed sensing;<br />
a greedy algorithm. Inverse Problems and Imaging, 3<br />
(3):487–503, 2009.<br />
<br />
Mairal, J., Elad, M., and Sapiro, G. Sparse representation<br />
for color image restoration. IEEE T. Image Processing,<br />
17(1):53–69, January 2008.<br />
<br />
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online<br />
dictionary learning for sparse coding. In ICML’09, 2009.<br />
Olshausen, B.A. and Field, D. Emergence of simple-cell<br />
receptive field properties by learning a sparse code for<br />
natural images. Nature, 381(6583):607–609, 1996.<br />
<br />
Ranzato, M., Huang, F.-J., Boureau, Y.-L., and LeCun,<br />
Y. Unsupervised learning of invariant feature hierarchies<br />
with applications to object recognition. In CVPR’07.<br />
IEEE, 2007a.<br />
<br />
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., and LeCun,<br />
Y. A unified energy-based framework for unsupervised<br />
learning. In AI-Stats’07, 2007b.<br />
<br />
Rozell, C.J., Johnson, D.H, Baraniuk, R.G., and Olshausen,<br />
B.A. Sparse coding via thresholding and local<br />
competition in neural circuits. Neural Computation, 20:<br />
2526–2563, 2008.<br />
<br />
Vonesch, C. and Unser, M. A fast iterative thresholding algorithm<br />
for wavelet-regularized deconvolution. In IEEE<br />
ISBI, 2007.<br />
<br />
Wu, T.T. and Lange, K. Coordinate descent algorithms<br />
for lasso penalized regression. Ann. Appl. Stat, 2(1):<br />
224–244, 2008.<br />
<br />
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang,<br />
Thomas. Linear spatial pyramid matching using sparse<br />
coding for image classification. In CVPR’09, 2009.<br />
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning<br />
using local coordinate coding. In NIPS’09, 2009.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26809learning Fast Approximations of Sparse Coding2015-11-22T09:50:09Z<p>Derek: /* Background */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which utilizes these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26808learning Fast Approximations of Sparse Coding2015-11-22T09:41:16Z<p>Derek: </p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Performance =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26807learning Fast Approximations of Sparse Coding2015-11-22T09:40:28Z<p>Derek: /* MNIST Digits */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error across all procedures, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26806learning Fast Approximations of Sparse Coding2015-11-22T09:39:16Z<p>Derek: /* Background */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26805learning Fast Approximations of Sparse Coding2015-11-22T09:37:57Z<p>Derek: /* Learned ISTA & Learned Coordinate Descent */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26804learning Fast Approximations of Sparse Coding2015-11-22T09:35:22Z<p>Derek: /* Background */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26803learning Fast Approximations of Sparse Coding2015-11-22T09:33:28Z<p>Derek: /* Iterative Shrinkage & Thresholding (ISTA) */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> \, L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26802learning Fast Approximations of Sparse Coding2015-11-22T09:32:49Z<p>Derek: /* Iterative Shrinkage & Thresholding (ISTA) */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, ''L'' is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26801learning Fast Approximations of Sparse Coding2015-11-22T09:31:40Z<p>Derek: /* Empirical Results */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center><br />
<br />
== MNIST Digits ==<br />
<br />
Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26800learning Fast Approximations of Sparse Coding2015-11-22T08:04:25Z<p>Derek: /* Time Complexity & Fast ISTA */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:LCOD.png&diff=26799File:LCOD.png2015-11-22T08:03:04Z<p>Derek: </p>
<hr />
<div></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26798learning Fast Approximations of Sparse Coding2015-11-22T08:02:48Z<p>Derek: /* Empirical Results */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. <br />
<br />
Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center><br />
<br />
Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training. <br />
<br />
<center><br />
[[File:LCOD.png |frame | center |Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training ]]<br />
</center></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26797learning Fast Approximations of Sparse Coding2015-11-22T07:54:17Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent. Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when ''m'' = 100, and 35 iterations when ''m'' = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA. <br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes ''m'' = 100, and "4x" for ''m'' = 400 ]]<br />
</center></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:LISTA2.png&diff=26796File:LISTA2.png2015-11-22T07:40:53Z<p>Derek: </p>
<hr />
<div></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26795learning Fast Approximations of Sparse Coding2015-11-22T07:40:12Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent.<br />
<br />
<center><br />
[[File:LISTA2.png |frame | center |Figure 1: Comparison of LISTA and FISTA. ]]<br />
</center></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:LISTA.png&diff=26794File:LISTA.png2015-11-22T07:38:48Z<p>Derek: </p>
<hr />
<div></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26793learning Fast Approximations of Sparse Coding2015-11-22T07:38:25Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent.<br />
<br />
<center><br />
[[File:LISTA.png |frame | center |Figure 1: Comparison of LISTA and FISTA. ]]<br />
</center></div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26792learning Fast Approximations of Sparse Coding2015-11-22T07:33:34Z<p>Derek: /* Berkeley Image Database */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes ''m'' = 100 and ''m'' = 400, using a sparsity penalty of <math> \alpha = 0.5 </math>, and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26791learning Fast Approximations of Sparse Coding2015-11-22T07:24:25Z<p>Derek: /* Learned ISTA & Learned Coordinate Descent */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times ''T''. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and ''S'' are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code ''Z''.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Using a sparsity penalty of <math> \alpha = 0.5 </math>, the authors tested performance on dictionaries of sizes ''m'' = 100 and ''m'' = 400.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26790learning Fast Approximations of Sparse Coding2015-11-22T07:18:51Z<p>Derek: /* Learned ISTA & Learned Coordinate Descent */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> \, T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Using a sparsity penalty of <math> \alpha = 0.5 </math>, the authors tested performance on dictionaries of sizes ''m'' = 100 and ''m'' = 400.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26789learning Fast Approximations of Sparse Coding2015-11-22T07:18:12Z<p>Derek: /* Empirical Results */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Using a sparsity penalty of <math> \alpha = 0.5 </math>, the authors tested performance on dictionaries of sizes ''m'' = 100 and ''m'' = 400.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26788learning Fast Approximations of Sparse Coding2015-11-22T07:08:37Z<p>Derek: /* Time Complexity & Fast ISTA */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures for some specified rate of prediction error.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26787learning Fast Approximations of Sparse Coding2015-11-22T07:07:19Z<p>Derek: /* Empirical Results */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.<br />
<br />
= Empirical Results =<br />
<br />
Two sets of experiments were undertaken: <br />
<br />
* The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.<br />
* The MNIST digits dataset was drawn from in assessing whether improved error-rates in code prediction yields superior performance in recognition tasks. <br />
<br />
Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures for some specified rate of prediction error.<br />
<br />
== Berkeley Image Database ==<br />
<br />
From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26786learning Fast Approximations of Sparse Coding2015-11-22T05:41:25Z<p>Derek: /* Learned ISTA & Learned Coordinate Descent */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.<br />
<br />
= Empirical Results =</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26785learning Fast Approximations of Sparse Coding2015-11-22T05:32:47Z<p>Derek: </p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.<br />
<br />
= Empirical Results =</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26784learning Fast Approximations of Sparse Coding2015-11-22T05:31:09Z<p>Derek: /* Learning ISTA & Learning Coordinate Descent */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learned ISTA & Learned Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times <math> T </math>. We learn the parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that <math> \, S </math> is shared across the <math> T </math> time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.<br />
<br />
Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters <math> \, W_e </math>, <math> \, \theta </math>, and <math> \, S </math> are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code <math> \, Z </math>.</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26783learning Fast Approximations of Sparse Coding2015-11-22T05:04:36Z<p>Derek: /* Iterative Shrinkage & Thresholding (ISTA) */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
In Learning ISTA (LISTA), the encoder structure takes the form defined by (**).</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26782learning Fast Approximations of Sparse Coding2015-11-22T05:03:54Z<p>Derek: /* Iterative Shrinkage & Thresholding (ISTA) */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) = h_{\theta}(W_eX + SZ^{k}) </math> (**),<br />
:: where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
In Learning ISTA (LISTA), the encoder structure takes the form defined by (**).</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26781learning Fast Approximations of Sparse Coding2015-11-22T05:03:39Z<p>Derek: /* Iterative Shrinkage & Thresholding (ISTA) */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) = h_{\theta}(W_eX + SZ^{k}) </math> (**),<br />
where <math> W_e = \frac{1}{L}W_d^T </math> is the filter matrix, and <math> S = I - \frac{1}{L}W_d^TW_d </math> is the mutual-inhibition matrix.<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
In Learning ISTA (LISTA), the encoder structure takes the form defined by (**).</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26780learning Fast Approximations of Sparse Coding2015-11-22T04:59:25Z<p>Derek: /* Learning ISTA & Learning Coordinate Descent */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) </math> (**)<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
In Learning ISTA (LISTA), the encoder structure takes the form defined by (**).</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26779learning Fast Approximations of Sparse Coding2015-11-22T04:57:15Z<p>Derek: /* Iterative Shrinkage & Thresholding (ISTA) */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) </math> (**)<br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
These procedures, known as Learning ISTA (LISTA) and Learning Coordinate Descent (LCOD),</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26778learning Fast Approximations of Sparse Coding2015-11-22T04:56:46Z<p>Derek: /* Review of Sparse Coding */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) </math><br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
These procedures, known as Learning ISTA (LISTA) and Learning Coordinate Descent (LCOD),</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26777learning Fast Approximations of Sparse Coding2015-11-22T04:56:09Z<p>Derek: /* Review of Sparse Coding */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math> (1), <br />
<br />
where <math> \, \alpha </math> is the specified sparsity penalty. Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) </math><br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
These procedures, known as Learning ISTA (LISTA) and Learning Coordinate Descent (LCOD),</div>Derekhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=learning_Fast_Approximations_of_Sparse_Coding&diff=26776learning Fast Approximations of Sparse Coding2015-11-22T04:54:58Z<p>Derek: /* Review of Sparse Coding */</p>
<hr />
<div>= Background =<br />
<br />
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space. <br />
<br />
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.<br />
<br />
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.<br />
<br />
= Review of Sparse Coding =<br />
<br />
For an input <math> X \epsilon \mathbb{R}^n </math>, we seek a new representation <math> Z \epsilon \mathbb{R}^m </math> which satisfies the previously-stated objective. In order to find an optimal code <math> \, Z </math> of <math> \, X </math>, we also require a dictionary <math> W_d \epsilon \mathbb{R}^{m x n} </math>, the matrix of normalized vectors that the coordinates of <math> \, Z </math> are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.<br />
<br />
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:<br />
<br />
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math> (1), for some chosen sparsity penalty <math> \alpha </math>.<br />
<br />
Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>. <br />
<br />
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.<br />
<br />
=Pre-existing Approximations: Iterative Shrinkage Algorithms=<br />
<br />
==Iterative Shrinkage & Thresholding (ISTA)==<br />
<br />
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing <math> \, Z^{(0)} = 0 </math>, we have the recursive update rule:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) </math><br />
<br />
Here, <math> L </math> is an upper-bound on the size of the eigenvalues of <math> W_d^TW_d </math>, and <math>\, h_{\theta}( ) </math> is the shrinkage function with components <math> \, h_{\theta}(V)_i = sign(V_i) </math> <math> \, max(|V_i| - \theta_i, </math> <math> \, 0) </math>, where <math> \theta \epsilon \mathbb{R}^m </math> consists of the sparsity thresholds for the components of the code. Thresholds are typically set to <math> \theta_i =\frac{\alpha}{L} </math>.<br />
<br />
=== Time Complexity & Fast ISTA ===<br />
<br />
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be <math> \, O(m^2) </math>, <math> \, O(nm) </math>, or <math> \, O(km) </math>, with <math> \, k </math> being the average sparsity across samples and iterations.<br />
<br />
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:<br />
<br />
:: <math> Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) </math><br />
<br />
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.<br />
<br />
== Coordinate Descent ==<br />
<br />
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.<br />
<br />
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes <math> \, O(m) </math> operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires <math> \, O(m^2) </math> steps. Alternatively, we could repeat the update process <math> \, O(n) </math> times instead, to achieve a per-iteration complexity of <math> \, O(nm) </math>, which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.<br />
<br />
= Encoders for Sparse Code Approximation =<br />
<br />
In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input <math> \, X \epsilon \mathbb{R}^n </math> and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original <math> \, X </math> as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.<br />
<br />
==A Simplistic Architecture and its Limitations==<br />
<br />
The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function <math> \, h_{\theta}( ) </math> used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use <math> \, h_{\theta}( ) </math> to maintain a strong basis for comparison with the previous methods.<br />
<br />
Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.<br />
<br />
== Learning ISTA & Learning Coordinate Descent ==<br />
<br />
To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders.<br />
<br />
Before understanding the rationale behind this approach, we must recognize a few relevant values which are automatically fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector <math> \, \theta </math>, a filter matrix <math> \, W_e </math>, and a mutual-inhibition matrix <math> \, S </math>, where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn <math> \, \theta </math>, <math> \, W_e </math>, <math> \, S </math>, and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. <br />
<br />
These procedures, known as Learning ISTA (LISTA) and Learning Coordinate Descent (LCOD),</div>Derek