Difference between revisions of "This Looks Like That: Deep Learning for Interpretable Image Recognition"
m (→Examples of reasoning process)
m (→Training Algorithm)
|Line 17:||Line 17:|
== Training Algorithm ==
== Training Algorithm ==
== Datasets ==
== Datasets ==
Revision as of 20:18, 5 December 2020
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans.
Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.
The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161 previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study) Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores. These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.
The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k.
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.
Examples of reasoning process
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.