stat441w18/Saliency-based Sequential Image Attention with Multiset Prediction
Presented by
1. Alice Wang
2. Robert Huang
3. Yufeng Wang
4. Renato Ferreira
5. Being Fan
6. Xiaoni Lang
7. Xukun Liu
8. Handi Gao
Introduction
We are able to achieve high performances in image classification using current techniques, however, the techniques often exhibit unexpected and unintuitive behaviour, allowing minor perturbations to cause a complete misclassification. In addition, the classifier may accurately classify the image, while completely missing the object in question (for example, classifying an image containing a polar bear correctly because of the snowy setting).
To remedy this, we can either isolate the object and its surroundings and re-evaluate whether the classifier still performs adequately, or we can apply a saliency detection method to determine the focus of the classifier, and to understand how the classifier makes its decisions.
A commonly used method for saliency detection takes an image, then recursively removes sections of the image and evaluates the impact on the accuracy of the classification. The smallest region that causes the biggest impact on the classification score makes up our saliency map. However, this iterative method is computationally intensive and thus time-consuming.
This paper proposes a new saliency detection method that uses a trained model to predict the saliency map from a single feed-forward pass. The resulting saliency detection is not only order of magnitudes faster, but benchmarks against standard saliency detection methods also show that we have produced higher quality saliency masks and achieved better localization results.
Related Works
Numerous methods for saliency detection have been proposed since the introduction of CNNs in 2015. One such method uses gradient calculations to find the region with the greatest gradient magnitude, under the assumption that such a region is a valid salient region. Another suggests the use of guided back-propogation (Springenberg et al., 2014), which takes into account the error signal for gradient adjustment, and excitation back-propogation (Zeiler et al., 2013), which uses non-negatively weighted connections to find meaningful saliency regions. However, these gradient-based methods, while fast, provide us with a sub-par saliency map that is difficult to interpret, or to adjust and improve upon.
An alternative approach (Zhou et al., 2014) takes an image, then iteratively removes patches (setting its colour to the mean) such that the class score is preserved. This gives us an excellent explanatory map, easily human-interpretable. However, the iterative method is time-consuming and unsuited for a real-time saliency application.
Another technique (Cao et al., 2015) that has been proposed is to selectively ignore network activations, such that the resulting subset of the activations provides us with the highest class score. This method, again, uses an iterative process that is unsuited for real-time application.
Fong and Vedaldi, 2017 proposes a similar solution, but instead of preserving the class score, they aim to maximally reduce the class score by removing minimal sections of the image. In doing so, they end up with a saliency map that is easily human-interpretable, and model-agnostic (on account of performing the optimization on the image space and not the model itself).