# Summary for survey of neural networked-based cancer prediction models from microarray data

## Contents

## Presented by

Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou

## Introduction

Microarray technology is widely used in analyzing genetic diseases as it can help researchers detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can gain a better understanding about the pathology of cancer. However, what could affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. The former (feature selection methods), reduce the dimensionality of your data-set by selecting only a subset of the key discerning features to use as input to your model. In contrast, the latter (feature creation methods), create an entirely new set of lower dimensional features, meant to represent your original (higher-dimensional) features. One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions.

## Background

**Neural Network**

Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function for example sigmoid or rectified linear activation functions. To train the network, the inputs are fed forward and the activation function value is calculated at every neuron. The difference between the output of the neural network and the desired output is what we called an error.
The backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.
In the next sections, we will use the above algorithm but with different network architectures and a different numbers of neurons to review the neural network-based cancer prediction models for learning the gene expression features.

**Cancer prediction models**

Cancer prediction models often contain more than 1 method to achieve high prediction accuracy with a more accurate prognosis and it also aims to reduce the cost of patients.

High dimensionality and spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.

The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is a good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.

## Neural network-based cancer prediction models

By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers.

**Datasets and preprocessing**

Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > [math]10^{-05}[/math] to remove some unwanted technical variation and [math]\log_2[/math] transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on.

The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.

**Neural network architecture**

Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.

All the neurons in the neural network work together as feature detectors to learn the features from the input. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.

**Building neural networks-based approaches for gene expression prediction**

According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together.
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.

**Neural network filters for cancer prediction**

In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function.

$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$

$$ Logloss = \sum{(I(x)\log(O(x)) + (1 - I(x))\log(1 - O(x)))} $$

There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).

Overfitting is a major problem that most autoencoders need to deal with to achieve high efficiency of the extracted features. Regularization, dropout, and sparsity are common solutions.

The neural network filtering methods were used by different statistical methods and classifiers. The conventional methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoost or others.

By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.

**Neural network prediction methods for cancer**

The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.

The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.

Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN.

**Neural network clustering methods in cancer prediction**

Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).

$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$

In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm for several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.

SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration.

Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.

## Summary

Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering gene abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.

Neural network filtering methods are used to reduce the dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architectures for best practice as they combine many nonlinearities.

Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons that diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions. Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithms such as SOM since those methods do not have the capability to distinguish the noisy genes.

## Discussion

There are some technical problems that can be considered and improved for building new models.

1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. (4) Augmentation of the dataset to produce more "observations".

2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level.

3. Model evaluation: In Braga-Neto and Dougherty's research, they have investigated several model evaluation methods: cross-validation, substitution and bootstrap methods. The cross-validation was found to be unreliable for small size data since it displayed excessive variance. The bootstrap method proved more accurate predictability.

4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms, data and methodology.

## Conclusion

This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy.

## Critiques

While results indicate that the functionality of the neural network determines its general architecture, the decision on the number of hidden layers, neurons, hypermeters and learning algorithm is made using trial-and-error techniques. Therefore improvements in this area of the model might need to be explored in order to obtain better results and in order to make more convincing statements.

## Reference

Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.