supervised Dictionary Learning

From statwiki
Revision as of 08:45, 30 August 2017 by Conversion script (talk | contribs) (Conversion script moved page Supervised Dictionary Learning to supervised Dictionary Learning: Converting page titles to lowercase)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This paper proposes a novel discriminative formulation for sparse representation of images using learned dictionaries.

Introduction

Sparse models originated from two different communities under two different names, one by neurologists mainly by the salient work done by Olshausen in <ref name="Olshausen1996">B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, vol. 381, pp. 607-609, 1996.</ref> as sparse coding (SC), and second by researchers in the field of signal processing as independent component analysis (ICA) (for example see <ref name="ICABook">A. Hyvärinen, J. Karhunen, and E. Oja. Independent component analysis. New York: John Wiley and Sons, 2001.</ref> for a comprehensive overview of ICA). Although SC and ICA originated from two different problems (the former as the model of simple cells in visual cortex and the latter as a solution for decomposing the independent sources of some mixed signals), they merged, eventually, into similar technique (with a somewhat different description). Unlike principal component analysis (PCA) decompositions, these models are in general overcomplete, i.e. the number of basis elements are in general greater than the dimension of the data. A paper by Lewicki et al. that details how to learn overcomplete representations can be found here. Recent research has shown that sparsity helps to capture higher-order correlation in data. For example, [3, 4] of the original paper by Mairal et al. on supervised dictionary learning discussed how sparse decompositions, in conjunction with predefined dictionaries, were applied to face and signal recognition.
On the other hand, the representation of a signal using a learned dictionary instead of predefined operators (such as wavelets in signal and image processing or local binary patterns (LBP) in texture classification) has led to state-of-the-art results in various applications such as denoising <ref name="Elad2006">M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. IP, vol. 54, no. 12, 2006.</ref> and texture classification <ref name="VZ2009">M. Varma and A. Zisserman. A statistical approach to material classification using image patch exemplars. IEEE Trans. PAMI, vol. 31, no. 11, pp. 2032-2047, 2009.</ref>.
It is well known that sparsity captures the higher order statistics of data. For example, in comparing PCA and ICA, while PCA can only capture up to the second order statistics of the data and hence is appropriate for Gaussian models, ICA can capture higher order statistics of the data. Whitening data is a preprocessing step in ICA and ICA is, hence, appropriate for supergaussian models (such as data with Laplacian distributions) <ref name="ICABook"/>.
The previous work in the literature on sparse representation is done on either predefined (fixed) operators or learned dictionaries for reconstructive, discriminative, or generative models in various applications such as signal and face recognition <ref>K. Huang and S. Aviyente. Sparse representation for signal classification. In NIPS, 2006.</ref><ref>J.Wright, A.Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. PAMI, vol.31, no. 2, pp. 210-227, 2009.</ref><ref>R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unlabeled data. In ICML, 2007.</ref><ref>J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Learning discriminative dictionaries for local image analysis. In CVPR, 2008.</ref><ref>M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations with deep networks. In ICML, 2008.</ref>.
In this paper, the authors extend these approaches by proposing a framework for learning simultaneously a single shared dictionary as well as sparse models (for all classes) in a mixed generative and discriminative formulation. Although, this joint generative/discriminative framework have been also reported in probabilistic approaches and in neural networks, but not in sparse dictionary learning.

Sparse Representation and Dictionary Learning

Sparse representation can be exploited in two different ways. First, by representing a signal as a linear combination of predefined bases such as wavelets <ref name="MallatSparseBook">S. Mallat. A wavelet tour of signal processing, the sparse way. Burlington: Academic Press, 3rd ed., 2009.</ref>. Second, by using a dictionary of primitive elements learned from the signal and decomposing the signal into these primitive elements <ref name="Julesz1981">B. Juelsz. Textons, the elements of texture-perception, and their interactions. Nature, vol. 290, pp. 91–97, 1981.</ref><ref name="Olshausen1996"/><ref name="VZ2009"/>. There are two steps in the latter approach, i.e., learning the dictionary and computation of (sparse) coefficients for representing the signal using the elements of dictionary.
In classical sparse coding tasks, the goal is to reconstruct a signal [math]\displaystyle{ \,x \in R^n }[/math] by using a fixed dictionary [math]\displaystyle{ D=[d_{1},...,d_{k}]\in \mathbb{R}^{n\times k} }[/math]. Regarding this dictionary [math]\displaystyle{ \,D }[/math], [math]\displaystyle{ \,k \gt n }[/math] so as to make the dictionary overcomplete. Using [math]\displaystyle{ \,l_1 }[/math] regularization (details of which provided in Y. Albert Park's lecture notes regarding it can be found here), a signal [math]\displaystyle{ \mathbf{x}\in \mathbb{R}^{n} }[/math] can be reconstructed using sparse coefficients [math]\displaystyle{ \mathbf{\alpha} }[/math] using

[math]\displaystyle{ \mathcal{R^{\star}}(\mathbf{x},D)=\underset{\mathbf{\alpha}\in \mathbb{R}^{k}}{\min}\left \| \mathbf{x}-D\mathbf{\alpha} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha} \right \|_{1}, \;\;\;(1) }[/math]


The [math]\displaystyle{ \,l_{1} }[/math] penalty yields a sparse solution for coefficients [math]\displaystyle{ \,\mathbf{\alpha} }[/math]. Other sparsity penalties such as the [math]\displaystyle{ \,l_0 }[/math] penalty can be used as well. However, since the [math]\displaystyle{ \,l_1 }[/math] penalty uses a proper norm, its formulation of sparse coding is a convex problem. As a result, the [math]\displaystyle{ \,l_{1} }[/math] penalty makes the optimization tractable for most algorithms. It should be noted that, on the other hand, sparse coding done using the [math]\displaystyle{ \,l_0 }[/math] penalty is an NP-hard problem and, as a result, it must typically be approximated by using a greedy algorithm. Furthermore, the [math]\displaystyle{ \,l_{1} }[/math] penalty has proven in practice to be very stable, in that the resulting decomposition would only be affected slightly should the input signal [math]\displaystyle{ \,x }[/math] be perturbed. Due to its many benefits, the [math]\displaystyle{ \,l_{1} }[/math] penalty is typically used as default for reconstructing signals in sparse coding tasks.
In (1), the dictionary is fixed and the goal is to find the sparse coefficients [math]\displaystyle{ \mathbf{\alpha} }[/math] such that [math]\displaystyle{ \mathbf{x} }[/math] can be reconstructed using the bases in the dictionary. The dictionary can be learned using m training data [math]\displaystyle{ (\mathbf{x}_{i})_{i=1}^{m} }[/math] in [math]\displaystyle{ \mathbb{R}^{n} }[/math]. Hence, (1) can be modified as follows:

[math]\displaystyle{ \underset{D,\mathbf{\alpha}}{\min}\sum_{i=1}^{m}\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha}_{i} \right \|_{1}. \;\;\;(2) }[/math]


Since the reconstruction errors [math]\displaystyle{ \left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2} }[/math] in (2) are invariant to scaling dictionary [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] by a scalar and coefficients [math]\displaystyle{ \mathbf{\alpha}_{i} }[/math] by its inverse, the [math]\displaystyle{ \ell_{2} }[/math] norm of the columns of [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] should be constrained <ref name="Elad2006"/>. This reconstructive framework is called REC in this paper.

Supervised Dictionary Learning

In this paper, initially a binary classification task is considered and then the proposed approach is extended to multiclass problems. In two-class problem, using signals [math]\displaystyle{ (\mathbf{x}_{i})_{i=1}^{m} }[/math] and their corresponding binary labels [math]\displaystyle{ (y_{i}\in\left \{ -1,+1 \right \})_{i=1}^{m} }[/math], the dictionary [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] adapted to the classification task and a function [math]\displaystyle{ \,f }[/math] that has positive values for all signals in class [math]\displaystyle{ \,1 }[/math] and negative values for all signals in class [math]\displaystyle{ \,0 }[/math] are learned. Both linear and bilinear models are considered in this paper. In linear (L) model

[math]\displaystyle{ f(\mathbf{x},\mathbf{\alpha},\mathbf{\theta})=\mathbf{w}^{T}\mathbf{\alpha}+b, \;\;\;(3) }[/math]


where [math]\displaystyle{ \mathbf{\theta}=\left \{\mathbf{w}\in\mathbb{R}^{k},b\in\mathbb{R} \right \} }[/math]; whereas in bilinear (BL) model

[math]\displaystyle{ f(\mathbf{x},\mathbf{\alpha},\mathbf{\theta})=\mathbf{x}^{T}\mathbf{W}\mathbf{\alpha}+b, \;\;\;(4) }[/math]


where [math]\displaystyle{ \mathbf{\theta}=\left \{\mathbf{W}\in\mathbb{R}^{n\times k},b\in\mathbb{R} \right \} }[/math].


It should be noted that the bilinear model contains more parameters than the linear model, and it is thus a richer model.


We shall now take a look at how the linear and bilinear models can be interpreted. As explained in detail in Mairal et al. 's paper on supervised dictionary learning, the linear model has a probabilistic interpretation. This is true because the mixed approach (discussed below) is a classical trade-off between generative learning and discriminative learning, in which generative components are often added to discriminative frameworks in order to make the learned model more robust to factors such as noise and occlusions. Also, as explained in detail in Mairal et al. 's paper, the bilinear model has a kernel interpretation. This is true because, given any two signals [math]\displaystyle{ \,x_1 }[/math] and [math]\displaystyle{ \,x_2 }[/math] having coefficients [math]\displaystyle{ \,\alpha_1 }[/math] and [math]\displaystyle{ \,\alpha_2 }[/math], if the kernel [math]\displaystyle{ \,K(x_1, x_2) = \alpha_1^T \alpha_2 x_1^T x_2 }[/math] is used in a logistic regression classifier, then it can shown that the resulting decision function would have the same form as the function [math]\displaystyle{ \,f }[/math] in [math]\displaystyle{ \,(4) }[/math].


The supervised dictionary learning (SDL) can be performed in three different approaches, i.e., reconstructive, generative, and discriminative approaches.

A classical approach to obtaining [math]\displaystyle{ \,\alpha }[/math] for either the linear model or the bilinear model is to first adapt the dictionary [math]\displaystyle{ \,D }[/math] to the data. This is done by solving

[math]\displaystyle{ \underset{\mathbf{D,\alpha}}{\min}\sum_{i=1}^{m} ||x_i - D \alpha_i||_{2}^{2} + \lambda_1 ||\alpha_i||_1 }[/math]

. Note that since the reconstruction errors [math]\displaystyle{ \,||x_i - D \alpha_i||_{2}^{2} }[/math] are invariant to simultaneously scaling [math]\displaystyle{ \,D }[/math] by a scalar and scaling [math]\displaystyle{ \,\alpha_i }[/math] by its inverse, it is necessary to constrain the [math]\displaystyle{ \,l_2 }[/math] norm of the columns of [math]\displaystyle{ \,D }[/math]. In fact, this is a classical constraint in the area of sparse coding.

Reconstructive Approach

In reconstructive (REC) approach <ref name="Elad2006"/>, the dictionary [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] and the coefficients [math]\displaystyle{ \mathbf{\alpha}_{i} }[/math] are learned using (2). The parameters [math]\displaystyle{ \mathbf{\theta} }[/math] are learned afterwords by solving

[math]\displaystyle{ \underset{\mathbf{\theta}}{\min}\sum_{i=1}^{m}\mathcal{C}(y_{i}f(\mathbf{x}_{i},\mathbf{\alpha}_{i},\mathbf{\theta}))+\lambda_{2}\left \| \mathbf{\theta} \right \|_{2}^{2}, \;\;\;(5) }[/math]


where [math]\displaystyle{ \mathcal{C} }[/math] is the logistic loss function, i.e., [math]\displaystyle{ \mathcal{C}(x)=log(1+e^{-x}) }[/math] and [math]\displaystyle{ \,\lambda_{2} }[/math] is a regularization parameter to prevent overfitting. Note that the resulting sparse codes [math]\displaystyle{ \,\alpha_i }[/math]'s, one for each signal [math]\displaystyle{ \,x_i }[/math], can be used a posteriori in a regular classifier such as logistic regression.

Generative Approach

The supervised dictionary learning using generative approach (DSL-G) learns jointly [math]\displaystyle{ \mathbf{\mathit{D}} }[/math], [math]\displaystyle{ \mathbf{\theta} }[/math], and [math]\displaystyle{ \mathbf{\alpha} }[/math] by solving

[math]\displaystyle{ \underset{D,\mathbf{\theta},\mathbf{\alpha}}{\min}(\sum_{i=1}^{m}\mathcal{C}(y_{i}f(\mathbf{x}_{i},\mathbf{\alpha}_{i},\mathbf{\theta}))+\lambda_{0}\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha}_{i} \right \|_{1})+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}, \;\;\;(6) }[/math]


where [math]\displaystyle{ \,\lambda_{0} }[/math] controls the importance of the reconstruction term. The classification procedure involves supervised sparse coding

[math]\displaystyle{ \underset{y\in\left \{ -1;+1 \right \}}{\min}\mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,y) , \;\;\;(7) }[/math]


with

[math]\displaystyle{ \mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})=\underset{\mathbf{\alpha}}{\min} \mathcal{S}(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}), \;\;\;(8) }[/math]


where, [math]\displaystyle{ \mathcal{S}(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})=\mathcal{C}(y_{i}f(\mathbf{x}_{i},\mathbf{\alpha}_{i},\mathbf{\theta}))+\lambda_{0}\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha}_{i} \right \|_{1} }[/math].

The learning procedure in (6) minimizes the sum of the costs for the pairs [math]\displaystyle{ (\mathbf{x}_{i},y_{i})_{i=1}^{m} }[/math] and corresponds to a generative model.

Discriminative Approach

Although in (7), the different costs [math]\displaystyle{ \mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,y) }[/math] of a given signal are compared for each class [math]\displaystyle{ \,y= -1, +1 }[/math], a more discriminative approach is to make the value of [math]\displaystyle{ \mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,-y_{i}) }[/math] greater than [math]\displaystyle{ \mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,y_{i}) }[/math], which is the purpose of the logistic loss function [math]\displaystyle{ \mathcal{C} }[/math]. This leads to

[math]\displaystyle{ \underset{D,\mathbf{\theta}}{\min}\sum_{i=1}^{m}(\mathcal{C}(\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,-y_{i})-\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})))+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}. \;\;\;(9) }[/math]


However, a mixed approach of generative formulation given in (6) and its discriminative version in (9) is easier to be solved. Hence, in this paper a generative /discriminative model is proposed for sparse signal representation and classification from the learned dictionary [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] and model [math]\displaystyle{ \mathbf{\theta} }[/math] as follows

[math]\displaystyle{ \underset{D,\mathbf{\theta}}{\min}\sum_{i=1}^{m}(\mu \mathcal{C}(\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,-y_{i})-\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}))+(1-\mu)\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}))+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}, \;\;\;(10) }[/math]


where [math]\displaystyle{ \,\mu }[/math] controls the trade-off between reconstruction and discrimination terms. Hereafter, this mixed model is referred to as supervised dictionary learning-discriminative (SDL-D) model. The same as before, constraint is imposed on [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] such that [math]\displaystyle{ \forall j, \left \| \mathbf{d}_{j}\leq 1 \right \|_{2} }[/math], i.e. we constrain the norm of the columns of [math]\displaystyle{ \,D }[/math] to be at most 1.

Multiclass Extension

The extension of all these formulations to multiclass problems is straightforward and can be done using softmax discriminative cost functions [math]\displaystyle{ \mathcal{C}_{i}(x_{1},...,x_{p})=\log(\sum_{j=1}^{p}e^{x_{j}-x_{i}}) }[/math], which are multiclass versions of the logistic function and by learning one model [math]\displaystyle{ \mathbf{\theta}_{i} }[/math] per class. Compared to the earlier approach that learns one dictionary for each class, this model has the advantage of letting multiple classes share some features, and it also uses the coefficients of the sparse representations (the [math]\displaystyle{ \,\alpha_i }[/math]'s) within the classification procedure itself. Other possible approaches include the one-vs-all and one-vs-one techniques, though there is still no hard rule regarding how to select the best approach given the context of the problem.

Optimization Procedure

In classical dictionary learning techniques, a dictionary [math]\displaystyle{ \,D \in \mathbb{R}^{n\times k} }[/math] is learned so that it adapts to a training set. This problem is shown in [math]\displaystyle{ \,(5) }[/math] above, and it is an optimization problem with respect to the dictionary [math]\displaystyle{ \,D }[/math] and the coefficients [math]\displaystyle{ \,\alpha }[/math]. Even though this problem is not jointly convex simultaneously with respect to both [math]\displaystyle{ \,D }[/math] and [math]\displaystyle{ \,\alpha }[/math], it is convex with respect to either unknown when the other unknown is held fixed. As a result of this property of the optimization problem, block coordinate descent gives good results, although the resulting optima may not necessarily be global optima. Block coordinate descent can be applied to the generative approach that is discussed above, where it consists of iterating between supervised sparse coding (in which both [math]\displaystyle{ \,D }[/math] and [math]\displaystyle{ \,\theta }[/math] are held fixed while [math]\displaystyle{ \,\alpha }[/math] is being optimized) and supervised dictionary update (in which [math]\displaystyle{ \,\alpha }[/math] is held fixed while both [math]\displaystyle{ \,D }[/math] and [math]\displaystyle{ \,\theta }[/math] are being updated). However, the discriminative version of the problem, which is SDL-D discussed above, is more problematic than the generative approach because, in order to find a local minimum for this difficult non-convex optimization problem, it is often necessary to use a continuation method that starts from the generative case (which is [math]\displaystyle{ \,(6) }[/math] given above) and ends with the discriminative case (which is [math]\displaystyle{ \,(9) }[/math] given above).


In the following, the algorithm for supervised dictionary learning is presented

Input: [math]\displaystyle{ \,n }[/math] (signal dimension); [math]\displaystyle{ (\,\mathbf{x}_{i},y_{i})_{i=1}^{m} }[/math] (training signals); [math]\displaystyle{ \,k }[/math] (size of the dictionary); [math]\displaystyle{ \,\lambda_{0}, \lambda_{1}, \lambda_{2} }[/math] (parameters); [math]\displaystyle{ 0\leq\mu_{1}\leq\mu_{2}\leq...\leq\mu_{m}\leq1 }[/math] (increasing sequence).
Output: [math]\displaystyle{ D\in \mathbb{R}^{n\times k} }[/math] (dictionary); [math]\displaystyle{ \mathbf{\theta} }[/math] (model).
Initialization: Set [math]\displaystyle{ \mathbf{\mathit{D}} }[/math] to a random Gaussian matrix with normalized columns. Set [math]\displaystyle{ \mathbf{\theta} }[/math] to zero.
Loop: For [math]\displaystyle{ \,\mu=\mu_{1},...,\mu_{m}, }[/math]
Loop: Repeat until convergence (or a fixed number of iterations),
Supervised sparse coding: Solve, for all [math]\displaystyle{ \,i=1,..., m }[/math]

[math]\displaystyle{ \left\{\begin{matrix} \mathbf{\alpha}_{i,-}^{\star} =\arg\min_{\mathbf{\alpha}}\mathcal{S}^{\star }(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,-1)\\ \mathbf{\alpha}_{i,+}^{\star} =\arg\min_{\mathbf{\alpha}}\mathcal{S}^{\star }(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,+1) \end{matrix}\right.. \;\;\;(11) }[/math]


Dictionary and model update: Solve

[math]\displaystyle{ \underset{D,\mathbf{\theta}}{\min}(\sum_{i=1}^{m}\mu \mathcal{C}(\mathcal{S}(\mathbf{\alpha}_{i,-}^{\star},\mathbf{x}_{i},D,\mathbf{\theta} ,-y_{i})-\mathcal{S}(\mathbf{\alpha}_{i,+}^{\star},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}))+(1-\mu)\mathcal{S}(\mathbf{\alpha}_{i,y_{i}}^{\star},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}) \;\;\mathbf{s.t.}\; \forall j,\left \| \mathbf{d}_{j} \right \|_{2}\leq 1. \;\;\;(12) }[/math]


However, it should be noted that, with the exception when [math]\displaystyle{ \,\mu }[/math] is close to [math]\displaystyle{ \,0 }[/math], the problem of updating [math]\displaystyle{ \,D }[/math] and [math]\displaystyle{ \,\theta }[/math] in [math]\displaystyle{ \,(12) }[/math] is in general not convex. Nevertheless, a local minimum can be obtained by using projected gradient descent. Experimental results have also shown that a local minimum found in this way is usually good enough in terms of classification performance. Details regarding dictionary update can be found under Section 4.2 of Mairal et al.'s paper on supervised dictionary learning that is also listed in References.

References

<references />

12. Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, Andrew Zisserman. Supervised Dictionary Learning.