stat441w18/Saliency-based Sequential Image Attention with Multiset Prediction
Presented by
1. Alice Wang
2. Robert Huang
3. Yufeng Wang
4. Renato Ferreira
5. Being Fan
6. Xiaoni Lang
7. Xukun Liu
8. Handi Gao
Introduction
We are able to achieve high performances in image classification using current techniques, however, the techniques often exhibit unexpected and unintuitive behaviour, allowing minor perturbations to cause a complete misclassification. In addition, the classifier may accurately classify the image, while completely missing the object in question (for example, classifying an image containing a polar bear correctly because of the snowy setting).
To remedy this, we can either isolate the object and its surroundings and re-evaluate whether the classifier still performs adequately, or we can apply a saliency detection method to determine the focus of the classifier, and to understand how the classifier makes its decisions.
A commonly used method for saliency detection takes an image, then recursively removes sections of the image and evaluates the impact on the accuracy of the classification. The smallest region that causes the biggest impact on the classification score makes up our saliency map. However, this iterative method is computationally intensive and thus time-consuming.
This paper proposes a new saliency detection method that uses a trained model to predict the saliency map from a single feed-forward pass. The resulting saliency detection is not only order of magnitudes faster, but benchmarks against standard saliency detection methods also show that we have produced higher quality saliency masks and achieved better localization results.
Related Works
Numerous methods for saliency detection have been proposed since the introduction of CNNs in 2015. One such method uses gradient calculations to find the region with the greatest gradient magnitude, under the assumption that such a region is a valid salient region. Another suggests the use of guided back-propogation (Springenberg et al., 2014), which takes into account the error signal for gradient adjustment, and excitation back-propogation (Zeiler et al., 2013), which uses non-negatively weighted connections to find meaningful saliency regions. However, these gradient-based methods, while fast, provide us with a sub-par saliency map that is difficult to interpret, or to adjust and improve upon.
An alternative approach (Zhou et al., 2014) takes an image, then iteratively removes patches (setting its colour to the mean) such that the class score is preserved. This gives us an excellent explanatory map, easily human-interpretable. However, the iterative method is time-consuming and unsuited for a real-time saliency application.
Another technique (Cao et al., 2015) that has been proposed is to selectively ignore network activations, such that the resulting subset of the activations provides us with the highest class score. This method, again, uses an iterative process that is unsuited for real-time application.
Fong and Vedaldi, 2017 proposes a similar solution, but instead of preserving the class score, they aim to maximally reduce the class score by removing minimal sections of the image. In doing so, they end up with a saliency map that is easily human-interpretable, and model-agnostic (on account of performing the optimization on the image space and not the model itself).
Image Saliency and Introduced Evidence
No single metric can measure the quality of produced map.
Saliency map is defined as a summarised explanation of where the classifier “looks” to make its prediction.
There are 2 saliency definitions:
1. Smallest sufficient region(SSR)
2. Smallest destroying region(SDR)
SSR is small and hard to recognize but contains important info with 90% accuracy. SDR is larger.
There are some ways removing evidence like removing, setting to constant color, adding noise, or completely cropping out. But all of them bring some side effect, like Misclassification.
Fighting the Introduced Evidence
Here’s case of applying a mask M to image X to obtain the edited image E.
E = X \odot M
This operation sets “0” color to certain part.
The mask M can generate adversarial artifacts. Adversarial artifacts are very small and imperceivable by people but can ruin the classifier. This phenomenon should be avoided.
There are a few ways to make the introduction of artifacts harder. We apply a mask to reduce the amount of unwanted evidence.
E = X \odot M + A \odot (1 - M)
where A is an alternative image. A can be chosen as a blurring version of X. Therefore harder to generate high-frequency-high-evidence artifacts. But blur does not eliminate existing evidence all.
Another choice of A is random color with high-frequency noise. This makes E more unpredictable at region where M is low.
But adversarial artifacts still occur. Thus it is necessary to encourage smoothness of M via a total variation(TV) penalty. We can also resize smaller mask to the required size as resizing can be seen as a smoothness mechanism.
Masking Model
The masking model depends on three main elements: a black-box classification model, the architecture of the masking model itself, and the masking loss function.
Any classification model can be used as the black-box element; examples are GoogleNet and ResNet. The purpose of this model is to judge how good the masking model is in selecting the salient region of images.
The masking model has the following architecture: first, a convolutional downsampling stage goes from the pixel representation of an image to progressively more coarse-grained feature representations. Then, the coarsest feature representation if given to the feature filter, which applies a non-linearity to select which coarse-grained positions to keep or drop from the image. Note that the feature selector has access to the true label so that it can select relevant portions of the image. Finally, the filtered features are upsampled to a higher resolution, using the features captured in the downsampling stage to fine-tune the saliency map (e.g., fit to edges of the image and other details). This gives the final output mask of the masking model.
Third, the loss function is a trade-off that tries to achieve, at the same time: low total variation of the mask (so as to prevent brittle, high-frequency masks); low average mask value (so as to make the mask small); high output class probability (the masked image should be highly relevant); and low “inverse-masked” output probability (the dropped part should not be very relevant).
The model is trained to minimize this loss, using the masking model to learn the output mask, and the black-box model to provide classification probabilities. This ensures that a small and simple mask which selects salient regions for classification is learned.
To help prevent overfitting, two optimizations are important: first, sometimes a “fake”, incorrect label is provided to the masking model. This encourages the model to only select a salient region when it sees the true label, since when it sees the wrong label, the only way to reduce the loss function is to drop the entire image. For example, it should not select a dog from the image if the label is “cat”. Second, the mask is applied using a random choice of either blurred image, or random colour with Gaussian noise. This ensures that the image given to the classifier is “unpredictable” outside the masked region, reducing the possibility of adversarial issues and improving the quality of the saliency mask.