# This Looks Like That: Deep Learning for Interpretable Image Recognition

Nouha Chatti

## Introduction

The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in a humanly understandable way dealing with classification tasks. The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists in dissecting parts of the input images and comparing them to prototypical parts of training images of a given class: Thus the expression this looks like that. In fact, this solution adds a transparency advantage to deep neural networks and allows the user to understand the actual process of decision making. It can intervene in many crucial problems that require understanding the actions that led to a particular output of the model. There are many fields that already rely on this case-based reasoning especially in the medical domain where diagnosis using X-ray scans is based on comparing these latter to other prototypical scans.

## Previous Work

Interpretability in Deep neural networks has been a long-sought goal and seems to attract more and more attention recently. The opacity present in neural networks that leaves the user unaware of the exact process of how model makes predictions has inspired many studies where their ultimate goal was to reach a certain transparency. As a matter of fact, there already exists posthoc interpretability methods that analyze a performance of a trained CNN. Although this type of analysis do not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determines parts of the input they are looking at but without associating them to prototypes.

## Network Architecture

The figure below represents the ProtoPNet architecture. The first layers of propPNet consist of commonly used convolutional layers f. (their parameters are denoted wconv). The layers used in this study are from the following known models VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161 previously pretrained on ImageNet.They are also followed by two additional 1 × 1 convolutional layers. A layer called prototype gp, a fully connected layer h with weight wh and no bias that returns the output prediction using a softmax function unlike all the rest of the layers that use ReLU as activation function. This network takes in an image x propagates it through the convolutional layers (f of shape H x W x D) where features are extracted and learns the porotypes P of shape (1 x 1 x D). the number of prototypes mk is pre-defined for each class k (10 per class in this study) Each will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output z = f(x), the prototype unit gpj in the prototype layer gp computes the squared L2 distances between the j-th prototype pj and all patches of z that have the same shape as pj and returns similarity scores. These scores values indicates the presence of the prototypical part In the image all while preserving the spatial relation of z. It is possible to upsample it to the original size in order to obtain a heat map with the different part that are most similar to the compared prototypes. The scores given by each unit are produced using max pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, multiplied by the weight matrix wh in h to produce the output logits as shown in the figure.

Figure 1 : Prototypical Part Network Architecture

## Training Algorithm

The training of the Network is divide into 3 stages: Starting with stochastic gradient descent (SGD) of the layers (other than the last one) then projection of prototype and finally the convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k: $P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \}$ During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p. In the last training stage, the convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other word it makes the model ignore the reasoning process of decision making of this kind: an image belongs to a given class because it is not have prototypes from another class.

## Datasets

The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two example of the classification task process of images from both datasets and the process of decision making.

### Examples of reasoning process

As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.

Figure 2 : Classifying an image of specific car model
Figure 3 : Predicting the specie of a bird

## Results:

The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below.

Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset
Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset

Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process.

## Conclusion

The aim of constructing the ProtopNet network was introduce the interpretability property to neural networks. As a matter of fact, the use of ProtopNet makes the process of classifying images clearer. It is able to dissect images to find prototypical parts. And the predictions of classes of an image are made based on a comparison of parts of this image and learned prototypes of each classes. One constraint of this Network can be the pre-determined number of prototypes as it is domain related and has to be set beforehand.

## Source code

The code for this paper is available at [1]