supervised Dictionary Learning

This paper proposes a novel discriminative formulation for sparse representation of images using learned dictionaries.

Introduction

Sparse models were originated from two different communities under two different names, one by neurologists mainly by the salient work done by Olshausen in <ref name="Olshausen1996">B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, vol. 381, pp. 607-609, 1996.</ref> as sparse coding, and second by researchers in the field of signal processing as independent component analysis (ICA) (for example see <ref name="ICABook">A. Hyvärinen, J. Karhunen, and E. Oja. Independent component analysis. New York: John Wiley and Sons, 2001.</ref> for a comprehensive overview of ICA). Although SC and ICA originated from two different problems (the former as the model of simple cells in visual cortex and the latter as the solution to decompose the independent sources of some mixed signals), they merged, eventually, into similar technique (with somewhat different description). Unlike principal component analysis (PCA) decompositions, these models are in general overcomplete, i.e. the number of basis elements are in general greater than the dimension of the data. A paper by Lewicki et al. that details how to learn overcomplete representations can be found here. Recent research has shown that sparsity helps to capture higher-order correlation in data. For example, [3, 4] of the original paper by Mairal et al. on supervised dictionary learning discussed how sparse decompositions, in conjunction with predefined dictionaries, were applied to face and signal recognition.
On the other hand, representation of a signal using a learned dictionary instead of predefined operators (such as wavelets in signal and image processing or local binary patterns (LBP) in texture classification) has led to state-of-the-art results in various applications such as denoising <ref name="Elad2006">M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. IP, vol. 54, no. 12, 2006.</ref> and texture classification <ref name="VZ2009">M. Varma and A. Zisserman. A statistical approach to material classification using image patch exemplars. IEEE Trans. PAMI, vol. 31, no. 11, pp. 2032-2047, 2009.</ref>.
It is well known that sparsity captures higher order statistics of the data. For example, in comparing PCA and ICA, while PCA can only capture up to the second order statistics of the data and hence is appropriate for Gaussian models, ICA can capture higher order statistics of the data. Whitening data is a preprocessing step in ICA and. ICA is hence appropriate for supergaussian models (such as data with Laplacian distributions) <ref name="ICABook"/>.
The previous work in the literature on sparse representation is done on either predefined (fixed) operators or learned dictionaries for reconstructive, discriminative, or generative models in various applications such as signal and face recognition <ref>K. Huang and S. Aviyente. Sparse representation for signal classification. In NIPS, 2006.</ref><ref>J.Wright, A.Y. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Trans. PAMI, vol.31, no. 2, pp. 210-227, 2009.</ref><ref>R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unlabeled data. In ICML, 2007.</ref><ref>J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Learning discriminative dictionaries for local image analysis. In CVPR, 2008.</ref><ref>M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations with deep networks. In ICML, 2008.</ref>.
In this paper, the authors extend these approaches by proposing a framework for learning simultaneously a single shared dictionary as well as sparse models (for all classes) in a mixed generative and discriminative formulation. Although, this joint generative/discriminative framework have been also reported in probabilistic approaches and in neural networks, but not in sparse dictionary learning.

Sparse Representation and Dictionary Learning

Sparse representation can be exploited in two different ways. First, by representing a signal as a linear combination of predefined bases such as wavelets <ref name="MallatSparseBook">S. Mallat. A wavelet tour of signal processing, the sparse way. Burlington: Academic Press, 3rd ed., 2009.</ref>. Second, by using a dictionary of primitive elements learned from the signal and decomposing the signal into these primitive elements <ref name="Julesz1981">B. Juelsz. Textons, the elements of texture-perception, and their interactions. Nature, vol. 290, pp. 91–97, 1981.</ref><ref name="Olshausen1996"/><ref name="VZ2009"/>. There are two steps in the latter approach, i.e., learning the dictionary and computation of (sparse) coefficients for representing the signal using the elements of dictionary.
In classical sparse coding tasks, the goal is to reconstruct a signal $\,x \in R^n$ by using a fixed dictionary $D=[d_{1},...,d_{k}]\in \mathbb{R}^{n\times k}$. Regarding this dictionary $\,D$, $\,k \gt n$ so as to make the dictionary overcomplete. Using $\,l_1$ regularization (details of which provided in Y. Albert Park's lecture notes regarding it can be found here), a signal $\mathbf{x}\in \mathbb{R}^{n}$ can be reconstructed using sparse coefficients $\mathbf{\alpha}$ using

$\mathcal{R^{\star}}(\mathbf{x},D)=\underset{\mathbf{\alpha}\in \mathbb{R}^{k}}{\min}\left \| \mathbf{x}-D\mathbf{\alpha} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha} \right \|_{1}, \;\;\;(1)$

The $\,l_{1}$ penalty yields a sparse solution for coefficients $\,\mathbf{\alpha}$. Other sparsity penalties such as the $\,l_0$ penalty can be used as well. However, since the $\,l_1$ penalty uses a proper norm, its formulation of sparse coding is a convex problem. As a result, the $\,l_{1}$ penalty makes the optimization tractable for most algorithms. It should be noted that, on the other hand, sparse coding done using the $\,l_0$ penalty is an NP-hard problem and, as a result, it must typically be approximated by using a greedy algorithm. Furthermore, the $\,l_{1}$ penalty has proven in practice to be very stable, in that the resulting decomposition would only be affected slightly should the input signal $\,x$ be perturbed. Due to its many benefits, the $\,l_{1}$ penalty is typically used as default for reconstructing signals in sparse coding tasks.
In (1), the dictionary is fixed and the goal is to find the sparse coefficients $\mathbf{\alpha}$ such that $\mathbf{x}$ can be reconstructed using the bases in the dictionary. The dictionary can be learned using m training data $(\mathbf{x}_{i})_{i=1}^{m}$ in $\mathbb{R}^{n}$. Hence, (1) can be modified as follows:

$\underset{D,\mathbf{\alpha}}{\min}\sum_{i=1}^{m}\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha}_{i} \right \|_{1}. \;\;\;(2)$

Since the reconstruction errors $\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}$ in (2) are invariant to scaling dictionary $\mathbf{\mathit{D}}$ by a scalar and coefficients $\mathbf{\alpha}_{i}$ by its inverse, the $\ell_{2}$ norm of the columns of $\mathbf{\mathit{D}}$ should be constrained <ref name="Elad2006"/>. This reconstructive framework is called REC in this paper.

Supervised Dictionary Learning

In this paper, initially a binary classification task is considered and then the proposed approach is extended to multiclass problems. In two-class problem, using signals $(\mathbf{x}_{i})_{i=1}^{m}$ and their corresponding binary labels $(y_{i}\in\left \{ -1,+1 \right \})_{i=1}^{m}$, the dictionary $\mathbf{\mathit{D}}$ adapted to the classification task and a function $\,f$ that has positive values for all signals in class $\,1$ and negative values for all signals in class $\,0$ are learned. Both linear and bilinear models are considered in this paper. In linear (L) model

$f(\mathbf{x},\mathbf{\alpha},\mathbf{\theta})=\mathbf{w}^{T}\mathbf{\alpha}+b, \;\;\;(3)$

where $\mathbf{\theta}=\left \{\mathbf{w}\in\mathbb{R}^{k},b\in\mathbb{R} \right \}$; whereas in bilinear (BL) model

$f(\mathbf{x},\mathbf{\alpha},\mathbf{\theta})=\mathbf{x}^{T}\mathbf{W}\mathbf{\alpha}+b, \;\;\;(4)$

where $\mathbf{\theta}=\left \{\mathbf{W}\in\mathbb{R}^{n\times k},b\in\mathbb{R} \right \}$.

It should be noted that the bilinear model contains more parameters than the linear model, and it is thus a richer model.

The supervised dictionary learning (SDL) can be performed in three different approaches, i.e., reconstructive, generative, and discriminative approaches.

A classical approach to obtaining $\,\alpha$ for either the linear model or the bilinear model is to first adapt the dictionary $\,D$ to the data. This is done by solving
$\underset{\mathbf{D,\alpha}}{\min}\sum_{i=1}^{m} ||x_i - D \alpha_i||_{2}^{2} + \lambda_1 ||\alpha_i||_1$
. Note that since the reconstruction errors $\,||x_i - D \alpha_i||_{2}^{2}$ are invariant to simultaneously scaling $\,D$ by a scalar and scaling $\,\alpha_i$ by its inverse, it is necessary to constrain the $\,l_2$ norm of the columns of $\,D$. In fact, this is a classical constraint in the area of sparse coding.

Reconstructive Approach

In reconstructive (REC) approach <ref name="Elad2006"/>, the dictionary $\mathbf{\mathit{D}}$ and the coefficients $\mathbf{\alpha}_{i}$ are learned using (2). The parameters $\mathbf{\theta}$ are learned afterwords by solving

$\underset{\mathbf{\theta}}{\min}\sum_{i=1}^{m}\mathcal{C}(y_{i}f(\mathbf{x}_{i},\mathbf{\alpha}_{i},\mathbf{\theta}))+\lambda_{2}\left \| \mathbf{\theta} \right \|_{2}^{2}, \;\;\;(5)$

where $\mathcal{C}$ is the logistic loss function, i.e., $\mathcal{C}(x)=log(1+e^{-x})$ and $\,\lambda_{2}$ is a regularization parameter to prevent overfitting. Note that the resulting sparse codes $\,\alpha_i$'s, one for each signal $\,x_i$, can be used a posteriori in a regular classifier such as logistic regression.

Generative Approach

The supervised dictionary learning using generative approach (DSL-G) learns jointly $\mathbf{\mathit{D}}$, $\mathbf{\theta}$, and $\mathbf{\alpha}$ by solving

$\underset{D,\mathbf{\theta},\mathbf{\alpha}}{\min}(\sum_{i=1}^{m}\mathcal{C}(y_{i}f(\mathbf{x}_{i},\mathbf{\alpha}_{i},\mathbf{\theta}))+\lambda_{0}\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha}_{i} \right \|_{1})+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}, \;\;\;(6)$

where $\,\lambda_{0}$ controls the importance of the reconstruction term. The classification procedure involves supervised sparse coding

$\underset{y\in\left \{ -1;+1 \right \}}{\min}\mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,y) , \;\;\;(7)$

with

$\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})=\underset{\mathbf{\alpha}}{\min} \mathcal{S}(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}), \;\;\;(8)$ being the the loss for a pair $\,(x_i, y_i)$

where, $\mathcal{S}(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})=\mathcal{C}(y_{i}f(\mathbf{x}_{i},\mathbf{\alpha}_{i},\mathbf{\theta}))+\lambda_{0}\left \| \mathbf{x}_{i}-D\mathbf{\alpha}_{i} \right \|_{2}^{2}+\lambda_{1}\left \| \mathbf{\alpha}_{i} \right \|_{1}$.

The learning procedure in (6) minimizes the sum of the costs for the pairs $(\mathbf{x}_{i},y_{i})_{i=1}^{m}$ and corresponds to a generative model.

Discriminative Approach

Although in (7), the different costs $\mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,y)$ of a given signal are compared for each class $\,y= -1, +1$, a more discriminative approach is to make the value of $\mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,-y_{i})$ greater than $\mathcal{S}^{\star }(\mathbf{x},D,\mathbf{\theta} ,y_{i})$, which is the purpose of the logistic loss function $\mathcal{C}$. This leads to

$\underset{D,\mathbf{\theta}}{\min}(\sum_{i=1}^{m}\mathcal{C}(\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,-y_{i})-\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})))+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}. \;\;\;(9)$

However, a mixed approach of generative formulation given in (6) and its discriminative version in (9) is easier to be solved. Hence, in this paper a generative /discriminative model is proposed for sparse signal representation and classification from the learned dictionary $\mathbf{\mathit{D}}$ and model $\mathbf{\theta}$ as follows

$(\sum_{i=1}^{m}\mu \mathcal{C}(\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,-y_{i})-\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}))+(1-\mu)\mathcal{S}^{\star }(\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}))+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}, \;\;\;(10)$

where $\,\mu$ controls the trade-off between reconstruction and discrimination terms. Hereafter, this mixed model is referred to as supervised dictionary learning-discriminative (SDL-D) model. The same as before, constraint is imposed on $\mathbf{\mathit{D}}$ such that $\forall j, \left \| \mathbf{d}_{j}\leq 1 \right \|_{2}$, i.e. we constrain the norm of the columns of $\,D$ to be at most 1.

Multiclass Extension

The extension of all these formulations to multiclass problems is straightforward and can be done using softmax discriminative cost functions $\mathcal{C}_{i}(x_{1},...,x_{p})=\log(\sum_{j=1}^{p}e^{x_{j}-x_{i}})$, which are multiclass versions of the logistic function and by learning one model $\mathbf{\theta}_{i}$ per class.

Optimization Procedure

In the following, the algorithm for supervised dictionary learning is presented

Input: n (signal dimension); $(\mathbf{x}_{i},y_{i})_{i=1}^{m}$ (training signals); k (size of the dictionary); $\lambda_{0}, \lambda_{1}, \lambda_{2}$ (parameters); $0\leq\mu_{1}\leq\mu_{2}\leq...\leq\mu_{m}\leq1$ (increasing sequence).
Output: $D\in \mathbb{R}^{n\times k}$ (dictionary); $\mathbf{\theta}$ (model).
Initialization: Set $\mathbf{\mathit{D}}$ to a random Gaussian matrix with normalized columns. Set $\mathbf{\theta}$ to zero.
Loop: For $\mu=\mu_{1},...,\mu_{m},$
Loop: Repeat until convergence (or a fixed number of iterations),
Supervised sparse coding: Solve, for all $i=1,..., m$

$\left\{\begin{matrix} \mathbf{\alpha}_{i,-}^{\star} =\arg\min_{\mathbf{\alpha}}\mathcal{S}^{\star }(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,-1)\\ \mathbf{\alpha}_{i,+}^{\star} =\arg\min_{\mathbf{\alpha}}\mathcal{S}^{\star }(\mathbf{\alpha},\mathbf{x}_{i},D,\mathbf{\theta} ,+1) \end{matrix}\right.. \;\;\;(11)$

Dictionary and model update: Solve

$\underset{D,\mathbf{\theta}}{\min}(\sum_{i=1}^{m}\mu \mathcal{C}(\mathcal{S}(\mathbf{\alpha}_{i,-}^{\star},\mathbf{x}_{i},D,\mathbf{\theta} ,-y_{i})-\mathcal{S}(\mathbf{\alpha}_{i,+}^{\star},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i}))+(1-\mu)\mathcal{S}(\mathbf{\alpha}_{i,y_{i}}^{\star},\mathbf{x}_{i},D,\mathbf{\theta} ,y_{i})+\lambda_{2}\left \|\mathbf{\theta} \right \|_{2}^{2}) \;\;\mathbf{s.t.}\; \forall j,\left \| \mathbf{d}_{j} \right \|_{2}\leq 1. \;\;\;(12)$

<references />