Overview

This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems, 1090-1098, 2010.</ref> describes methods for learning features extracted through convolutional feature banks. In particular, it gives methods for using sparse coding convolutionally. In sparse coding, the sparse feature vector z is constructed to reconstruct the input x with a dictionary D. The procedure produces a code z* by minimizing the energy function:

$L(x,z,D) = \frac{1}{2}||x-Dz||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ min}} \ L(x,z,D)$

D is obtained by minimizing the above with respect to D: $\underset{z,D}{\operatorname{arg\ min}} \ L(x,z,D)$, averaged over the training set. The drawbacks to this method are the the representation is redundant and that the inference for a whole image is computationally expensive. The reason is that the system is trained on single image patches in most applications of sparse coding to image analysis, which produces a dictionary of filters that are essentially shifted versions of each other over the patch and reconstruct in isolation.

This first problem can be addressed by applying sparse coding to the entire image and treating the dictionary as a convolutional filter bank.

$L(x,z,D) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + |z|_1$

Where Dk is an s×s filter kernel, x is a w×h image, zk is a feature map of dimension (w+s-1)×(h+s-1), and * denotes the discrete convolution operator.

The second problem can be addressed by using a trainable feed-forward encoder to approximate the sparse code:

$L(z,z,D,W) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + \sum_{k=1}^K||z_k - f(W^k*x)||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ max}} \ L(x,z,D,W)$

Where Wk is an encoding convolutional kernel of size s×s, and f is a point-wise non-linear function. Both the form of f and the method to find z* are discussed below.

The contribution of this paper is to address these two issues simultaneously, thus allowing convolutional approaches to sparse coding.

Method

The authors extend the coordinate descent sparse coding algorithm detailed in <ref>Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.</ref> to use convolutional methods.

Two considerations for learning convolution dictionaries are:

1. Boundary effects due to convolution must be handled.
2. Derivatives should be calculated efficiently.

function ConvCoD$\, (x,D,\alpha)$

Set: $\, S = D^T*D$
Initalize: $\, z = 0;\ \beta = D^T * mask(x)$
Require: $\, h_\alpha:$: smooth thresholding function
repeat
$\, \bar{x} = h_\alpha(\beta)$
$\, (k,p,q) = \underset{i,m,n}{\operatorname{arg\ max}} |z_{imn}-\bar{z_{imn}}|$ (k: dictionary index, (p,q) location index)
$\, bi = \beta_{kpq}$
\, \beta = \beta + (z_kpg - \bar{z_{kpg}}) \times align(S(:,k,:,:),(p,q)) **
$\, z_{kpg} = \bar{z_{kpg}},\ \beta_{kpg} = bi$
until change in $z$ is below a threshold
end function

** MATLAB notation is used for slicing the tensor.

In the above, $\beta = D^T * mask(x)$ is use to handle boundary effects, where mask operates term by term and either puts zeros or scales down the boundaries.

The learning procedure is then stochastic gradient descent over the dictionary D, where the columns of D are normalized after each iteration.

$\forall x^i \in X$ training set: $z* = \underset{z}{\operatorname{arg\ max}}\ L(x^i,z,d), D \leftarrow D - \eta \frac {\partial(L,x^i,z^*,D}{\partial D}$

Two encoder architectures are tested. The first is steepest descent sparse coding with tanh encoding function using $g^k \times tanh(x*W^k)$, which does not include a shrinkage operator. Thus the ability to produce sparse representations is very limited.

The second is convolutional CoD sparse coding with a smooth shrinkage operator as defined below.

$\tilde{z}=sh_{\beta^k,b^k}(x*W^k)$ where k = 1..K.
$sh_{\beta^k,b^k}(s) = sign(s) \times 1/\beta^k \log(\exp(\beta^k \times |s|) - 1) - b^k$

where $\beta$ controls the smoothness of the kink of shrinkage operator and b controls the location of the kink. The second system in more efficient in training, but the performance for both systems are almost identical.

The convolutional encoder can also be used in multi-stage object recognition architectures. For each stage, the encoder is followed by absolute value rectification, contrast normalization and average subsampling.

Experiments

Two systems are used:

1. Steepest descent sparse coding with tanh encoder: $SD^{tanh}$
2. Coordinate descent sparse coding with shrink encoder: $CD^{shrink}$

Object Recognition using Caltech-101 Dataset

In the Caltech-101 dataset, each image contains a single object. Each image is processed by converting to grayscale and resizing, followed by contrast normalization. All results use 30 training samples per class and 5 different choices of the training set.

Architecture: 64 features are extracted by the first layer, followed by a second layer that produces 256 features. Second layer features are connected to first layer features by a sparse connection table.

First Layer: Both systems are trained using 64 dictionary elements, whether each dictionary item is a 9×9 convolution kernel. Both systems are trained for 10 sparsity values from 0.1-3.0.

Second Layer: In the second layer, each of 256 feature maps in connected to 16 randomly selected input features from the first layer.

One Stage System: In these results, the input is passed to the first layer, followed by absolute value rectification, contrast normalization, and average pooling. The output of the first layer is fed to a logistic classifier followed by the PMK-SVN classifier used in <ref>Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.</ref>.

Two Stage System: These results use both layers, followed by absolute value rectification, contrast normalization, and average pooling. Finally, a multinomial logistic regression classifier is used.

In the above, U represents one stage, UU represents two stages, and '+' represents supervise training is performed afterwards.

Pedestrian Detection

The architecture is trained and evaluated on the INRIA Pedestrian dataset <ref>Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.</ref> which contains 2416 positive examples (after mirroring) and 1218 negative full images. For training, the dataset is augmented with minor translations and scaling, giving a total of 11370 examples for training and 1000 images for classification. The negative examples are augmented with larger scale variations to avoid false positives, giving a total of 9001 samples for training and 1000 for validation.

The architecture for the pedestrian detection task is similar to that describe in the previous section. It was trained both with and without unsupervised initialization, followed by supervised training. After one pass of training, the negative set was augmented with the 10 most offending samples on each full negative image.

Discussion

• The paper presented an efficient method for convolutional training of feature extractors.
• The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.
• It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.