stat441w18/Saliency-based Sequential Image Attention with Multiset Prediction

From statwiki
Revision as of 11:42, 15 March 2018 by Y884wang (talk | contribs)
Jump to navigation Jump to search

Presented by

1. Alice Wang

2. Robert Huang

3. Yufeng Wang

4. Renato Ferreira

5. Being Fan

6. Xiaoni Lang

7. Xukun Liu

8. Handi Gao

Introduction

We are able to achieve high performances in image classification using current techniques, however, the techniques often exhibit unexpected and unintuitive behaviour, allowing minor perturbations to cause a complete misclassification. In addition, the classifier may accurately classify the image, while completely missing the object in question (for example, classifying an image containing a polar bear correctly because of the snowy setting).

To remedy this, we can either isolate the object and its surroundings and re-evaluate whether the classifier still performs adequately, or we can apply a saliency detection method to determine the focus of the classifier, and to understand how the classifier makes its decisions.

A commonly used method for saliency detection takes an image, then recursively removes sections of the image and evaluates the impact on the accuracy of the classification. The smallest region that causes the biggest impact on the classification score makes up our saliency map. However, this iterative method is computationally intensive and thus time-consuming.

This paper proposes a new saliency detection method that uses a trained model to predict the saliency map from a single feed-forward pass. The resulting saliency detection is not only order of magnitudes faster, but benchmarks against standard saliency detection methods also show that we have produced higher quality saliency masks and achieved better localization results.

Related Works

Numerous methods for saliency detection have been proposed since the introduction of CNNs in 2015. One such method uses gradient calculations to find the region with the greatest gradient magnitude, under the assumption that such a region is a valid salient region. Another suggests the use of guided back-propogation (Springenberg et al., 2014), which takes into account the error signal for gradient adjustment, and excitation back-propogation (Zeiler et al., 2013), which uses non-negatively weighted connections to find meaningful saliency regions. However, these gradient-based methods, while fast, provide us with a sub-par saliency map that is difficult to interpret, or to adjust and improve upon.

An alternative approach (Zhou et al., 2014) takes an image, then iteratively removes patches (setting its colour to the mean) such that the class score is preserved. This gives us an excellent explanatory map, easily human-interpretable. However, the iterative method is time-consuming and unsuited for a real-time saliency application.

Another technique (Cao et al., 2015) that has been proposed is to selectively ignore network activations, such that the resulting subset of the activations provides us with the highest class score. This method, again, uses an iterative process that is unsuited for real-time application.

Fong and Vedaldi, 2017 proposes a similar solution, but instead of preserving the class score, they aim to maximally reduce the class score by removing minimal sections of the image. In doing so, they end up with a saliency map that is easily human-interpretable, and model-agnostic (on account of performing the optimization on the image space and not the model itself).


Image Saliency and Introduced Evidence

No single metric can measure the quality of produced map.

Saliency map is defined as a summarised explanation of where the classifier “looks” to make its prediction.

There are 2 saliency definitions:

1. Smallest sufficient region(SSR)

2. Smallest destroying region(SDR)

SSR is small and hard to recognize but contains important info with 90% accuracy. SDR is larger.

There are some ways removing evidence like removing, setting to constant color, adding noise, or completely cropping out. But all of them bring some side effect, like Misclassification.

Fighting the Introduced Evidence

Here’s case of applying a mask M to image X to obtain the edited image E.

[math]\displaystyle{ E = X \odot M }[/math]

This operation sets “0” color to certain part.

The mask M can generate adversarial artifacts. Adversarial artifacts are very small and imperceivable by people but can ruin the classifier. This phenomenon should be avoided.

There are a few ways to make the introduction of artifacts harder. We apply a mask to reduce the amount of unwanted evidence.

[math]\displaystyle{ E = X \odot M + A \odot (1 - M) }[/math]

where A is an alternative image. A can be chosen as a blurring version of X. Therefore harder to generate high-frequency-high-evidence artifacts. But blur does not eliminate existing evidence all.

Another choice of A is random color with high-frequency noise. This makes E more unpredictable at region where M is low.

But adversarial artifacts still occur. Thus it is necessary to encourage smoothness of M via a total variation(TV) penalty. We can also resize smaller mask to the required size as resizing can be seen as a smoothness mechanism.

A New Saliency Metric

A new saliency measure is introduced instead of the traditional measure. Specifically, it defines a new saliency metric as log difference between the area of the rectangular crop and the probability of correct prediction of the classifier, i.e

[math]\displaystyle{ s(a,p) = log(\tilde(a)) - log(p) }[/math]

Masking Model

The masking model depends on three main elements: a black-box classification model, the architecture of the masking model itself, and the masking loss function.

Any classification model can be used as the black-box element; examples are GoogleNet and ResNet. The purpose of this model is to judge how good the masking model is in selecting the salient region of images.

The masking model has the following architecture: first, a convolutional downsampling stage goes from the pixel representation of an image to progressively more coarse-grained feature representations. Then, the coarsest feature representation if given to the feature filter, which applies a non-linearity to select which coarse-grained positions to keep or drop from the image. Note that the feature selector has access to the true label so that it can select relevant portions of the image. Finally, the filtered features are upsampled to a higher resolution, using the features captured in the downsampling stage to fine-tune the saliency map (e.g., fit to edges of the image and other details). This gives the final output mask of the masking model.

Third, the loss function is a trade-off that tries to achieve, at the same time: low total variation of the mask (so as to prevent brittle, high-frequency masks); low average mask value (so as to make the mask small); high output class probability (the masked image should be highly relevant); and low “inverse-masked” output probability (the dropped part should not be very relevant).

The model is trained to minimize this loss, using the masking model to learn the output mask, and the black-box model to provide classification probabilities. This ensures that a small and simple mask which selects salient regions for classification is learned.

To help prevent overfitting, two optimizations are important: first, sometimes a “fake”, incorrect label is provided to the masking model. This encourages the model to only select a salient region when it sees the true label, since when it sees the wrong label, the only way to reduce the loss function is to drop the entire image. For example, it should not select a dog from the image if the label is “cat”. Second, the mask is applied using a random choice of either blurred image, or random colour with Gaussian noise. This ensures that the image given to the classifier is “unpredictable” outside the masked region, reducing the possibility of adversarial issues and improving the quality of the saliency mask.

Experiments

The experiments compare the masks generated by the masking model trained using GoogleNet with other existing methods. And to assess them, three evaluation measurements are used.

Weakly Supervised Object Localisation Error

Object localisation accuracy is a standard method to evaluate produced saliency maps. The table below is the localisation errors on ImageNet validation set for different weakly supervised methods, including the one this paper proposed. Note that for comparison purpose, all the saliency maps are produced for GoogleNet classifier.

It can be seen that this model overperforms other approaches. It also performs significantly better than the baseline (centrally placed box) and iteratively optimised saliency masks.

Saliency Metric

The authors also use the new measurement introduced in this paper, saliency metric, and the results are in the following table.

The masking model achieves a considerably better saliency metric than other saliency approaches, especially, it significantly overperforms max box and center box baselines. The masking model is also very close to ground truth boxes which supports the claim that the interpretability of the localisation boxes generated by the masking model is similar to that of the ground truth boxes.

Detecting Saliency of CIFAR-10

Finally the authors test the performance of their model on CIFAR-10, a completely different dataset. The purpose of this experiment is to confirm the performance of their model on low-resolution images. To accommodate to this change, they use FitNet trained to 92% validation accuracy (therefore it is not a pre-trained model) as a black box classifier to train the masking model and modify the architecture slightly. Here is the saliency maps generated by the masking model.

Through the produced maps, we can still recognize the original objects which confirms that the masking model works even at low resolution and can be applied to a non-pre-trained model (FitNet).