learning Convolutional Feature Hierarchies for Visual Recognition: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "=Overview= This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Adv...")
 
Line 101: Line 101:


=Discussion=
=Discussion=
*
*The paper presented an efficient method for convolutional training of feature extractors.
*
*The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.
*
*It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.


=References=
=References=
<references />
<references />

Revision as of 19:43, 18 November 2015

Overview

This paper<ref>Kavukcuoglu, K, Sermanet, P, Boureau, Y, Gregor, K, Mathieu, M, and Cun, Y. . Learning convolutional feature hierarchies for visual recognition. In Advances in neural information processing systems, 1090-1098, 2010.</ref> describes methods for learning features extracted through convolutional feature banks. In particular, it gives methods for using sparse coding convolutionally. In sparse coding, the sparse feature vector z is constructed to reconstruct the input x with a dictionary D. The procedure produces a code z* by minimizing the energy function:

[math]\displaystyle{ L(x,z,D) = \frac{1}{2}||x-Dz||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ min}} \ L(x,z,D) }[/math]

D is obtained by minimizing the above with respect to D: [math]\displaystyle{ \underset{z,D}{\operatorname{arg\ min}} \ L(x,z,D) }[/math], averaged over the training set. The drawbacks to this method are the the representation is redundant and that the inference for a whole image is computationally expensive.

This first problem can be addressed by applying sparse coding to the entire image and treating the dictionary as a convolutional filter bank.

[math]\displaystyle{ L(x,z,D) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + |z|_1 }[/math]

Where Dk is an s×s filter kernel, x is a w×h image, zk is a feature map of dimension (w+s-1)×(h+s-1), and * denotes the discrete convolution operator.

The second problem can be addressed by using a trainable feed-forward encoder to approximate the sparse code:

[math]\displaystyle{ L(z,z,D,W) = \frac{1}{2}||x - \sum_{k=1}^K D_k * z_k ||_2^2 + \sum_{k=1}^K||z_k - f(W^k*x)||_2^2 + |z|_1, \ \ \ z^* = \underset{z}{\operatorname{arg\ max}} \ L(x,z,D,W) }[/math]

Where Wk is an encoding convolutional kernel of size s×s, and f is a point-wise non-linear function. Both the form of f and the method to find z* are discussed below.

Method

The authors extend the coordinate descent sparse coding algorithm detailed in <ref>Li, Y and Osher, S. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.</ref> to use convolutional methods.

Two considerations for learning convolution dictionaries are:

  1. Boundary effects due to convolution must be handled.
  2. Derivatives should be calculated efficiently.

function ConvCoD[math]\displaystyle{ \, (x,D,\alpha) }[/math]

Set: [math]\displaystyle{ \, S = D^T*D }[/math]
Initalize: [math]\displaystyle{ \, z = 0;\ \beta = D^T * mask(x) }[/math]
Require: [math]\displaystyle{ \, h_\alpha: }[/math]: smooth thresholding function
repeat
[math]\displaystyle{ \, \bar{x} = h_\alpha(\beta) }[/math]
[math]\displaystyle{ \, (k,p,q) = \underset{i,m,n}{\operatorname{arg\ max}} |z_{imn}-\bar{z_{imn}}| }[/math] (k: dictionary index, (p,q) location index)
[math]\displaystyle{ \, bi = \beta_{kpq} }[/math]
[math]\displaystyle{ \, \beta = \beta + (z_kpg - \bar{z_{kpg}}) \times align(S(:,k,:,:),(p,q)) }[/math] **
[math]\displaystyle{ \, z_{kpg} = \bar{z_{kpg}},\ \beta_{kpg} = bi }[/math]
until change in [math]\displaystyle{ z }[/math] is below a threshold
end function

** MATLAB notation is used for slicing the tensor.

In the above, [math]\displaystyle{ \beta = D^T * mask(x) }[/math] is use to handle boundary effects, where mask operates term by term and either puts zeros or scales down the boundaries.

The learning procedure is then stochastic gradient descent over the dictionary D, where the columns of D are normalized after each iteration.

[math]\displaystyle{ \forall x^i \in X }[/math] training set: [math]\displaystyle{ z* = \underset{z}{\operatorname{arg\ max}}\ L(x^i,z,d), D \leftarrow D - \eta \frac {\partial(L,x^i,z^*,D}{\partial D} }[/math]

Two encoder architectures are tested. The first is steepest descent sparse coding with tanh encoding function using [math]\displaystyle{ g^k \times tanh(x*W^k) }[/math].

The second is convolutional CoD sparse coding with a smooth shrinkage operator as defined below.

[math]\displaystyle{ \tilde{z}=sh_{\beta^k,b^k}(x*W^k) }[/math] where k = 1..K.
[math]\displaystyle{ sh_{\beta^k,b^k}(s) = sign(s) \times 1/\beta^k \log(\exp(\beta^k \times |s|) - 1) - b^k }[/math]

Experiments

Two systems are used:

  1. Steepest descent sparse coding with tanh encoder: [math]\displaystyle{ SD^{tanh} }[/math]
  2. Coordinate descent sparse coding with shrink encoder: [math]\displaystyle{ CD^{shrink} }[/math]

Object Recognition using Caltech-101 Dataset

In the Caltech-101 dataset, each image contains a single object. Each image is processed by converting to grayscale and resizing, followed by contrast normalization. All results use 30 training samples per class and 5 different choices of the training set.

Architecture: 64 features are extracted by the first layer, followed by a second layer that produces 256 features. Second layer features are connected to first layer features by a sparse connection table.

First Layer: Both systems are trained using 64 dictionary elements, whether each dictionary item is a 9×9 convolution kernel. Both systems are trained for 10 sparsity values from 0.1-3.0.

Second Layer: In the second layer, each of 256 feature maps in connected to 16 randomly selected input features from the first layer.

One Stage System: In these results, the input is passed to the first layer, followed by absolute value rectification, contrast normalization, and average pooling. The output of the first layer is fed to a logistic classifier followed by the PMK-SVN classifier used in <ref>Lazebnik, S, Schmid, C, and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR’06, 2:2169–2178, 2006.</ref>.

Two Stage System: These results use both layers, followed by absolute value rectification, contrast normalization, and average pooling. Finally, a multinomial logistic regression classifier is used.

In the above, U represents one stage, UU represents two stages, and '+' represents supervise training is performed afterwards.

Pedestrian Detection

The architecture is trained and evaluated on the INRIA Pedestrian dataset <ref>Dalal, N and Triggs, B. Histograms of oriented gradients for human detection. In Schmid, C, Soatto, S, and Tomasi, C, editors, CVPR’05, volume 2, pages 886–893, June 2005.</ref> which contains 2416 positive examples (after mirroring) and 1218 negative full images. For training, the dataset is augmented with minor translations and scaling, giving a total of 11370 examples for training and 1000 images for classification. The negative examples are augmented with larger scale variations to avoid false positives, giving a total of 9001 samples for training and 1000 for validation.

The architecture for the pedestrian detection task is similar to that describe in the previous section. It was trained both with and without unsupervised initialization, followed by supervised training. After one pass of training, the negative set was augmented with the 10 most offending samples on each full negative image.

Discussion

  • The paper presented an efficient method for convolutional training of feature extractors.
  • The resulting features look intuitively better than those obtained through non-convolutional methods, but classification results are only slightly better (where they're better at all) than existing methods.
  • It's not clear what effects in the pedestrian experiment are due to the method of preprocessing and variations on the dataset (scaling and translation) and which are due to the architecture itself. Comparisons are with other systems that processed input differently.

References

<references />