Difference between revisions of "Patch Based Convolutional Neural Network for Whole Slide Tissue Image Classification"

Presented by

Cassandra Wong, Anastasiia Livochka, Maryam Yalsavar, David Evans

Introduction

Figure 1: Whole Slide Tissue Image of a grade IV tumor. Features indicating subtypes are visually evident. In this case, patches framed in red are discriminative for the diagnosis as typical visual features of grade IV tumor are present. Patches framed in blue are non-discriminative for the final diagnosis as they only contain visual features from lower grade tumors. By nature of the task discriminative patches are spread throughout the image and appear at multiple locations.

Despite the fact that CNN are well-known for their success in image classification, it is computationally impossible to use them for cancer classification. This problem is due to high-resolution images that cancer classification is dealing with. As a result, this paper argues that using a patch level CNN can outperform an image level based one and considers two main challenges in patch level classification – aggregation of patch-level classification results and existence of non-discriminative patches. For dealing with these challenges, training a decision fusion model and an Expectation-Maximization (EM) based method for locating the discriminative patches are suggested respectively. At the end the authors proved their claims and findings by testing their model to the classification of glioma and non-small-cell lung carcinoma cases.

Previous Work

The proposed patch-level CNN and training a decision fusion model as a two-level model was made apparent by the various breakthroughs and results noted below:

• Majority of Whole Slide Tissue Images classification methods fixate on classifying or obtaining features on patches [1, 2, 3]. These methods excel when an abundance of patch labels are provided [1, 2], allowing patch-level supervised classifiers to learn the assortment of cancer subtypes. However, labeling patches requires specialized annotators; an excessive task at a large scale.
• Multiple Instance Learning (MIL) based classification [4, 5] utilizes unlabeled patches to predict a label of a bag. For a binary classification problem, the main assumption (Standard Multi-Instance assumption, SMI) states that a bag is positive if and only if there exists at least one positive instance in the bag. Some authors combine MIL with Neural Networks[6, 7] and model SMI by max-pooling. This approach is inefficient due to only one instance with a maximum score (because of max-pooling) being trained in one training iteration on the entire bag.
• Other works sometimes apply average pooling (voting). However, it has been shown that many decision fusion models can outperform simple voting[8, 9]. The choice of the decision fusion function would depend heavily on the domain.

EM-based Method with CNN

Figure 2. Top: A CNN is trained on patches and EM-based method iteratively eliminates non-discriminative patches. Bottom: An image-level decision fusion model is trained on histograms of patch-level predictions to predict the image-level label

The high-resolution image is modelled as a bag, and patches extracted from it are instances that form a specific bag. The ground truth labels are provided for the bag only, so we model the labels of an instance (discriminative or not) as a hidden binary variable. Hidden binary variables are estimated by the Expectation-Maximization algorithm. A summary of the proposed approach can be found in Fig.2. Please note that this approach will work for any discriminative model.

In this paper $X = \{X_1, \dots, X_N\}$ denotes dataset containing $N$ bags. A bag $X_i= \{X_{i,1}, X_{i,2}, \dots, X_{i, N_i}\}$ consists of $N_i$ pathes (instances) and $X_{i,j} = \lt x_{i,j}, y_j\gt$ denotes j-th instance and it’s label in i-th bag. We assume bags are i.i.d. (independent identically distributed), $X$ and associated hidden labels $H$ are generated by the following model: $$P(X, H) = \prod_{i = 1}^N P(X_{i,1}, \dots , X_{i,N_i}| H_i)P(H_i) \quad \quad \quad \quad (1)$$ $Hi = {H_{i, 1}, \dots, H_{i, Ni}}$ denotes the set of hidden variables for instances in the bag $X_i$ and $H_{i, j}$ indicates whether the patch $X_{i,j}$ is discriminative for $y_i$ (it is discriminative if estimated label of the instance coincides with the label of the whole bag). Authors assume that $X_{i, j}$ is independent from hidden labels of all other instances in the i-th bag, therefore $(1)$ can be simplified as: $$P(X, H) = \prod_{i = 1}^{N} \prod_{j=1}^{N_i} P(X_{i, j}| H_{i, j})P(H_{i, j}) \quad \quad (2)$$ Authors propose to estimate the hidden labels of the individual patches $H$ by maximizing the data likelihood $P(X)$ using Expectation Maximization. In one iteration of EM ​​we alternate between performing E step (Expectation) where we estimate hidden variables $H_{i, j}$ and M step (Maximization) where we update the parameters of the model $(2)$ such that data likelihood $P(X)$ is maximized. Let's denote $D$ the set of discriminative instances. We start by assuming all instances are in $D$ (all $H_{i, j}=1$).

Discriminative Patch Selection

The discriminative patches will have $P\left(H_{i,j}\right|X)$ greater than a threshold $T_{i,\ j}$. So, this part explains the way that authors estimated $P\left(H_\ \right|X)$ and selected the threshold. Since $P\left(H_{i,j}\right|X)$ is correlated with $P(y_i\ |\ x_{i,j}\ ;\ \theta)$ due to the fact that patches with a smaller $P(y_i\ |\ x_{i,j}\ ;\ \theta)$ will have a smaller probability to be considered as discriminative. This feature can cause the deletion of patches which are close to the decision boundary while they have valuable information. As the result, the authors designed $P\left(H_{i,j}\right|X)$ in a way to be more robust from this perspective. First, $P(y_i\ |\ x_{i,j}\ ;\ \theta)$ is calculated by averaging over the predictions of two CNN that are trained in parallel and in two different scales, then $P(y_i\ |\ x_{i,j}\ ;\ \theta)$ is denoised by using a gaussian kernel for finding $P\left(H_{i,j}\right|X)$. The results in the experimental section show that this approach for finding $P\left(H_{i,j}\right|X)$ is more robust. For calculating the threshold, first two variables $S_i$ and $E_c$ are introduced as the set of $P\left(H_{i,j}\right|X)$ values for all $x_{i,j}$ of the i-th image and the c-th class ,respectively. Then $T_{i,\ j}$ is calculated based on image-level threshold $H_i$ and class level threshold $R_i$ as follows:

$$T_{i, j}=min(H_i, R_i)$$

Where $H_i$ is the $P1$-th percentile of $S_i$, and $R_i$ is the $P2$-th percentile of $E_c$.

Image-level decision fusion model

Patch-level CNNs introduced in sec.3 are combined to make the class histogram of the patch-level predictions which are created by summing up all the class probabilities from each CNN. Then, these histograms are fed to a linear multi-class logistic regression model or an SVM with Radial Basis Function (RBF) kernel for classification purposes. The reason for combining those instances was that: First, assigning a label to an image just based on one patch-level prediction was not the authors’ desire. Second, a whole set of patches corresponding to an image can discriminate the correct label for that image although there are some patches that are not discriminative. Third, since the patch-level model can be biased using a fusion model can alleviate this bias in the patch-level model.

Experiments

Two classification problem from WSI is selected by authors for evaluating their model. Those classification tasks are classifying gliomaand Non-Small-Cell Lung Carcinoma (NSCLC) cases into glioma and NSCLC subtypes which the typical resolution of a WSI in this dataset is 100K by 50K pixels. Both glioma and NSCLC are common cancers that lead cancer related death in different ages, and recognizing their subtype is a critical task that is essential for providing targeted therapies.

References

[1] A. Cruz-Roa, A. Basavanhally, F. Gonzalez, H. Gilmore, M. Feldman, S. Ganesan, N. Shih, J. Tomaszewski, and A. Madabhushi. Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In Medical Imaging, 2014. 2, 3
[2] H. S. Mousavi, V. Monga, G. Rao, and A. U. Rao. Automated discrimination of lower and higher grade gliomas based on histopathological image analysis. JPI, 2015. 2, 6
[3] Y. Xu, Z. Jia, Y. Ai, F. Zhang, M. Lai, E. I. Chang, et al. Deep convolutional activation features for large scale brain tumor histopathology image classification and segmentation. In ICASSP, 2015. 2, 5, 6
[4] E. Cosatto, P.-F. Laquerre, C. Malon, H.-P. Graf, A. Saito, T. Kiyuna, A. Marugame, and K. Kamijo. Automated gastric cancer diagnosis on h&e-stained sections; ltraining a classifier on a large scale with multiple instance machine learning. In Medical Imaging, 2013. 2
[5] Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, E. I. Chang, et al. Deep learning of feature representation with multiple instance learning for medical image analysis. In ICASSP, 2014. 2
[6] J. Ramon and L. De Raedt. Multi instance neural networks. 2000. 2, 3
[7] Z.-H. Zhou and M.-L. Zhang. Neural networks for multiinstance learning. In ICIIT, 2002. 2, 3
[8] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. 3
[9] A. Seff, L. Lu, K. M. Cherry, H. R. Roth, J. Liu, S. Wang, J. Hoffman, E. B. Turkbey, and R. M. Summers. 2d view aggregation for lymph node detection using a shallow hierarchy of linear classifiers. In MICCAI. 2014. 3, 4