http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=D26ma&feedformat=atomstatwiki - User contributions [US]2024-03-28T21:30:26ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=48316Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-11-30T05:01:11Z<p>D26ma: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Unlike popular parametric approaches, which first utilizes training observations to learn a model and subsequently predict test samples with this model, the non-parametric k-nearest neighbors (kNNs) method classifies observations with a majority rules approach. kNN classifies test samples by assigning them the label of their k closest training observation (neighbours). This method has become very popular due to its significant performance and easy implementation. <br />
<br />
There are two main approaches to conduct kNN classification. The first is to use a fixed k value to classify all test samples. The second is to use a different k value for each test sample. The former, while easy to implement, has proven to be impractical in machine learning applications. Therefore, interest lies in developing an efficient way to apply a different optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work == <br />
<br />
Previous work on finding an optimal fixed k value for all test samples is well-studied. Zhang et al. [1] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involved selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning varied k values or scanning all training samples are time-consuming.<br />
<br />
== Motivation == <br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}</math> represents the training set can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, an <math>l_1</math> regularization term multiplied by a tuning parameter, <math>\rho_1</math>, is added to ensure that sparse results are generated as the objective is to minimize the number of training samples that will eventually be depended on by the test samples. <br />
<br />
The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features.<br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to optimize it to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 is non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
As opposed to a normal decision tree where the target data is the labels themselves, the target data when using the kTree method are the optimal ''k'' values for each sample that were solved for in the previous step so this decision tree has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbours across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 800x800px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 800x800px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score [15] approach was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 800x800px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods achieve efficient classification by designing a training step that reduces the run time of the test stage. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. Future areas of investigation could focus on the improvement of kTree and k*Tree for data with large numbers of features. <br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. <br />
* The authors addressed that some of the UCI datasets contained imbalance data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their ktTee and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be really helpful if Ktrees method can be explained at the very beginning. The transition from KNN to Ktrees are not very smooth.<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=48307User:Yktan2020-11-30T04:53:59Z<p>D26ma: /* Critiques/ Insights */</p>
<hr />
<div><br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators. Data obtained in large quantities through searching for images in search engines and data downloaded from social media sites (in a manner abiding by privacy and copyright laws) are especially noisy, since the labels are generally inferred from tags to save on human-annotation cost. <br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. The main limitation of these methods is that they do not perform well under high noise ratio and cause overfitting.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoid confirmation bias.<br />
2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabelling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix, which denotes the probability of clean labels flipping to noisy labels, to correct the loss function<br />
<li> Leveraging the predictions from DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods all have downsides: it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; and for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch, incorporates the two classes of regularization. These classes are consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (an SSL technique) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.<br />
<br />
== Model Architecture and Algorithm ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Being diverged also offers the two networks distinct abilities to filter different types of error, making the model more robust to noise. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate with reduced temperature, <math> \hat{y}_b </math>. <br />
<br />
<center><math> \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} </math>, for <math>c = 1, 2,..,C</math></center><br />
<br />
Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> <br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
The full algorithm is shown below. [[File:dividemix.jpg|600px| | center]]<br />
<div align="center">Algorithm1: DivideMix. Line 4-8: co-divide; Line 17-18: label co-refinement; Line 20: co-guessing.</div><br />
<br />
The when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term <math>\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))</math>, where <math>\text{p}^\text{c}_\text{model}</math> is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02 and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors combined self-divide with the original MixMatch as a naive baseline for using SLL in LNL.<br />
They also find that both label refinement and input augmentation are beneficial for DivideMix. ''Label refinement'' is important for high noise ratio due because samples that are noisier would be incorrectly divided into the labeled set. ''Augmentation'' upgrades model performance by creating more reliable predictions and by achieving consistent regularization. In addition, the performance drop was seen in the ''DivideMix w/o co-training'' highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.<br />
<br />
== Conclusion ==<br />
<br />
This paper provides a new and effective algorithm for learning with noisy labels by using highly noisy data unlabelled data in a Semi-Supervised Learning framework. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to deal with noise in datasets. Also, the DivideMix method has been tested using various datasets with the results consistently being one of the best when compared to the state-of-the-art methods through extensive experiments.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== Critiques/ Insights ==<br />
<br />
1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.<br />
<br />
2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.<br />
<br />
3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.<br />
<br />
4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.<br />
<br />
5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points. <br />
<br />
6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.<br />
<br />
7. The current benchmark accuracy for CIFAR-10 is 99.7, CIFAR-100 is 96.08 using EffNet-L2 in 2020. In 2019, CIFAR-10 is 99.37, CIFAR-100 is 93.51 using BiT-L.(based on paperswithcode.com) As there exists better methods, it would be nice to know why the authors chose these state-of-the-art methods to compare the test accuracy.<br />
<br />
8. Another interesting observation is that DivideMix seems to maintain a similar accuracy while some methods give unstable results. That shows the reliability of the proposed algorithm.<br />
<br />
9. It would be interesting to see if the drop in accuracy from increasing the noise ratio to 90% is a result of a low porportion or low number of clean labels. That is, would increasing the size of the training set but keeping the noise ratio at 90% result in increased accuracy?<br />
<br />
10. For Ablation Study part, the paper also introduced a study on the Robustness of Testing Marking Methods Noise, including AUC for classification of clean/noisy samples of CIFAR-10 training data. And it shows that the method can effectively separate clean and noisy samples as training proceeds.<br />
<br />
11. It is interesting how unlike common methods, the method in this paper discards the labels that are highly likely to be<br />
noisy. It also utilizes the noisy samples as unlabeled data to regularize training in a SSL manner. This model can better distinguish and utilize noisy samples.<br />
<br />
12. In the result section, the author gives us a comprehensive understanding of this algorithm by introducing the applications and the comparison of it with respect to similar methods. It would be attractive if in the application part, the author could indicate how the application relative to our daily life.<br />
<br />
13. High quality data is very important for training Machine learning systems. Preparing the data to train ML systems requires data annotations which are prone to errors and are time-consuming. It is interesting to note how paper 14 and this paper aims to approach this problem from different perspectives. Paper 14 introduces CSL algorithm that learns from confused or Noisy data to find the tasks associated with them. And this paper proposes an algorithm that shows good performance when learning from noisy data. Hence both the papers seem to tackle similar problem and implementing the approaches described in both the papers when handling noisy data can be twice helpful.<br />
<br />
14. Noise exists in all big data, and big data is what we are dealing with in real life nowadays. Having an effective noise eliminating method such as Dividemix is important to us.<br />
<br />
== References ==<br />
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=48302Superhuman AI for Multiplayer Poker2020-11-30T04:48:11Z<p>D26ma: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
A super-intelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. Zero-sum games are games where the gain for one player means a loss for the other player, so the sum of the gains and losses always equals zero. They almost dominated most of the board games in these twenty years. The most popular AI in the board games are the chess AI deep blue and the go chess AI Alpha-go. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy. An example of this is the Prisoner's dilemma, where two individuals each have the option to testify against the other or to remain silent. Although the optimal choice is to remain silent, the individuals have an incentive to act in their own self-interest which results in a less than optimal outcome.<br />
<br />
More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
Knowing the basics of Texas hold'em poker may help to understand how the AI can play. Texas hold'em is played using a standard deck of 52 cards. For each hand, players are each dealt two cards, which only they can see. Once players receive their two cards, but before any other cards have been turned, players (in clockwise order) have the option to call, raise, fold, or check. Call means to match the highest bet so far, raise means to increase the bet, fold means the player wishes to leave the current hand, and check means to continue without betting any further. A player can only check if no one has bet yet this round or if they have already matched the highest bet. Once this round of betting is over three cards, called the Flop, are turned over, and the betting happens in a clockwise manner again. This is followed by revealing another card, the Turn, and then proceeding with another round of betting. Now, the final card, the River, is turned. There is one more round of betting before the players reveal their cards and determine the winner. The player with the best five-card hand (out of the 7 cards) wins the hand and the money in the pot. The following is a list of the Texas hold'em hands from best to worst:<br />
* Royal Flush (A,K,Q,J,10 all of the same suit)<br />
* Straight Flush(5 in a row from the same suit)<br />
* 4 of a kind<br />
* Full house (3 of a kind + a pair)<br />
* Flush (5 cards of the same suit)<br />
* Straight (5 cards in a row)<br />
* 3 of a kind<br />
* 2 different pairs<br />
* A pair<br />
* Highest card<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.<br />
<br />
To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in-game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior. <br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, shifting away from the Nash equilibrium strategy opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: '''Action abstraction and information abstraction'''. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a built-in strategy - '''“Blueprint strategy”''', which it gradually improves by searching in real-time in situations it finds itself in during the course of the game. In the first betting round, pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision and zero regret indicates that you are indifferent.<br />
<br />
The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret are chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converge and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This causes the influence of the first iteration to decay at a rate of <math>\frac{1}{\sum_{t=1}^Tt} = \frac{2}{T(T+1)}</math>, compared to a rate of <math>\frac{1}{T}</math> in the original CFR algorithm. This leads to the strategy of improving more quickly in practice.<br />
<br />
An additional feature of Pluribus is that in the sub games, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand, then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a more balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format. Players were anonymized with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands. <br />
<br />
The second format includes one human player and 5 copies of Pluribus (1H+5AI). There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
The performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games. Significance tests were run at a 95% significance level with one-tailed t-tests to check the profitability of Pluribus's performance.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
[[File:top.PNG| 950px | x450px |left]]<br />
<br />
<br />
<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div> <br />
<br />
Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans and is proven to be profitable.<br />
<br />
== Discussion and Critiques ==<br />
<br />
Pluribus' Blueprint strategy and Abstraction methods effectively reduce the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using a non-theoretical approach in more real-life problems such as autonomous driving or stock market trading.<br />
<br />
Extending this idea beyond two-player zero-sum games will have many applications in real life. For example, the emergence of strategies used by the superhuman AI that were seen as "non-optimal" to humans, poses interesting questions about applying such an AI to fields such as business, economics and financial markets.<br />
<br />
The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example, sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.<br />
<br />
Superhuman AI, while sounding superior, is actually not uncommon. There have been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on a partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.<br />
<br />
This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking of more applications in other areas? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.<br />
<br />
One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents' cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize its profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing). In addition to that, the paper may discuss how human action could be changed in the game when they play with Superhuman AI. We can see that playing card games require various strategy and different people can have a different set of actions in the same game and in the same situation.<br />
<br />
One interesting software called Piosolver leverages a similar tree-based algorithm presented in the paper to recommend the move that is deemed game theory optimal (GTO). In the poker world, GTO is a play-style that is based on mathematics and is considered a "defensive" strategy. Following the rock, paper, scissors analogy from the paper, a GTO play-style is synonymous with choosing randomly between the three options, whereas an exploitative strategy involves reading a human player's tendencies and adjusting the strategy accordingly. Piosolver is used by many professional poker players to enhance their game and gain intuition on what the best move is in certain situations.<br />
<br />
Another way to train the proposed model can be a poker game with two or more AI players. That method was used by AlphaGo to train a better model. <br />
<br />
Games with various AI players would be an interesting topic, and through comparing different AI, their shortcomes could be observed and improved. More discussions on this would be of interests.<br />
<br />
Similar to Pluribis, another [https://science.sciencemag.org/content/356/6337/508 paper] discussed a different AI program, called DeepStack, which also has defeated professional poker players at a 2-player Texas hold'em variant. However, instead of finding the Nash Equilibria, DeepStack uses recursive reasoning, decomposition, and a form of intuition that is automatically learned from self-play.<br />
<br />
AI can calculate the exact odds of a specific card or set of cards dropped on the flop. The difficulty is that in poker, the player cannot see the opponent's hand. This kind of hidden information also brings other complications (such as fraud), which is more challenging for AI at the game. At the same time, using AI to train each other is also an interesting topic.<br />
<br />
In inspiration from the critique about fraud, could this AI be used to detect cheating, such as illicitly obtaining information from an opponents hand, in games between human players? For instance by comparing the decision of the humans versus that of the AI's given the same information. Or are the strategies between human players and this AI too different to make conclusions about suspicious humans plays?<br />
<br />
The summary is overall well-defined, but authors should explain more on the algorithm parts such that specify how this model perform in each step to achieve the optimal solution. Also, there is an issue such that we must have an assumption before applying this model. The assumption is human need to follow the "rules". What if this model is against with human which does not follow the "rules" then how this model will learn from it? This is a question to contemplate.<br />
<br />
This is an interesting topic discussing AI players in the game. It would be attractive if the summary can provide a brief comparison about the advantages and shortages of the AI players nowadays.<br />
<br />
It is interesting to see how AI has some strategy playing against people in Texas poker. However, this game also relies on how people behave in their body language and the way they talk. This paper focuses more on theoretical side but not 100% true in real life.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained only by self-play, it is an unbiased and a different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker is a widely recognized<br />
a milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (July 12, 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (November 17, 2020). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero<br />
<br />
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=48298Graph Structure of Neural Networks2020-11-30T04:42:41Z<p>D26ma: /* Critique */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes that perform mathematical computations, and directed edges link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layers and ultimately to the output neurons. <br />
<br />
However, few researchers have focused on the relationship between neural networks and their predictive performance, which is the main focus of this paper.<br />
<br />
In Neural Network research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. It would contribute in building more efficient and more accurate neural network architectures. A natural choice is to use a computational graph representation. Doing so, however, has many limitations, including: <br />
<br />
(1) Lack of generality: Computational graphs have to follow allowed graph properties, which limits the use of the rich tools developed for general graphs. <br />
<br />
(2) Disconnection with biology/neuroscience: Biological neural networks have a much complicated and less standardized structure. For example, there might be information exchanges in the brain networks. It is difficult to represent these models with directed acyclic graphs. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, the authors developed a new way of representing a neural network as a graph, called a relational graph. The key insight in the new representation is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows systematically exploring the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience. Based on the insights from<br />
neuroscience, the authors characterize neural networks by the clustering coefficient and average path length of their relational graphs.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graphs to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. Graphically, a self edge connects an edge to itself, which forms a loop.<br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on what type of neural network it is (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then outputs a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of the message function) and outputs the updated node feature. <br />
<br />
A relation graph is a graph <math>G</math> associated with several message exchange rounds, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message exchange does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. And the weight matrix will change to the convolutional filter. The representation of these more complicated networks are described in details in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient in this paper.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes; the clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all the nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results are shown below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures; notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use FLOPS (# of multiply-adds) as the metric. We first compute the FLOPS of our baseline network instantiations (i.e., complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
The model was trained for 200 epochs with batch size 128, using cosine learning rate schedule with an initial learning rate of 0.1<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Results and Discussions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There always exists a graph structure that has higher predictive accuracy under Top-1 error compared to the complete graph<br />
<br />
* There is a sweet spot such that the graph structure near the sweet spot usually outperforms the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments are consistent across multiple datasets and multiple graph structures with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There are similarities between the best artificial neurons and biological neurons.<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png |1000px]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
<br />
'''Neural networks performance depends on its structure'''<br />
<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compared to the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, the best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance.<br />
<br />
<br />
'''Sweet spot where performance is significantly improved'''<br />
<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
<br />
'''Relationship between neural network’s performance and parameters'''<br />
<br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial is a good visualization tool for the overall trend ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a U-shape. <br />
<br />
<br />
'''Consistency among many different tasks and datasets'''<br />
<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
<br />
'''Top architectures can be identified efficiently'''<br />
<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
<br />
'''Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks'''<br />
<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Conclusions=<br />
Our works provided a new viewpoint about combining the fields of graph neural networks (GNNs) and general architecture design by establishing graph structures. The authors believe that the model helps to capture the diverse neural network architectures under a unified framework. In particular, GNNs are instances of general neural architectures where graph structures are seen as input rather than part of the architecture and message functions are shared across all edges of the graph. We found that the graph theories and techniques implemented in other disciplines have the capability to provide a good reference for understanding and designing the structures and functions of neural networks. This could provide effective help and inspiration for the study of neural networks.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staying inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is a theoretical justification for why these graph properties may (or may not) contribute to the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentations of images?<br />
<br />
6. There is an interesting insight that the idea of the relational graph is similar to applying causal graphs in neuro networks, which is also closer to biology and neuroscience because human beings learning things based on causality. This new approach may lead to higher prediction accuracy but it needs more assumptions, such as correct relations and causalities.<br />
<br />
7. This is an interesting topic that uses the knowledge in graph theory to introduce this new structure of Neural Networks. Using more data sets to discuss this approach might be more interesting, such as the MNIST dataset. We think it is interesting to discuss whether this structure will provide a better performance compare to the "traditional" structure of NN in any type of Neural Networks.<br />
<br />
8. The key idea is to treat a neural network as a relational graph and investigate the messages exchange over the graphs. It will be interesting to discuss how much improvement a sweet spot can bring to the network.<br />
<br />
9. It is interesting to see how figure "b" in the plots showing the experiment results has a U-shape. This shows a correlation between message exchange efficiency and capability of learning distributed representations, giving an idea about the tradeoff between them, as indicated in the paper.<br />
<br />
10. From Results and Discussion part, we can find learning at a very abstract level of artificial neurons and biological neurons is similar. But the learning method is different. Artificial neural networks use gradient descent to minimize the loss function and reach the global minimum. Biological neural networks have another learning method.<br />
<br />
11. It would be interesting to see if the insights from this paper can be used to create new kinds of neural network architectures which were previously unknown, or to create a method of deciding which neural network to use for a given situation or data set. It may be interesting to compare the performance of neural networks by a property of their graph (like average path length as described in this paper) versus a property of the data (such as noisiness).<br />
<br />
12. It would be attractive if the paper dig more into the “sweet pot” and give us a brief introduction of it rather than just slightly mention it. It would be a little bit confused if this technical term shows up and without any previous mention or introduction<br />
<br />
13. The paper shows graphical results how well performing graphs cluster into ‘sweet spots’. And the ‘sweet spot’ leads to significant performance improvement of the graph. It would be interesting if the author could provide more justification for the experimental setup chosen and also explore if known graph theorems have any relation to this improved performance.<br />
<br />
14. GNN is really interesting, one application I can think of is it can be used for recommending system like Linkedin and Facebook. Since people who knows each other is really easy represented as graph.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=48294User:J46hou2020-11-30T04:39:18Z<p>D26ma: /* Critiques/Insights */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this paper, the “one-class” classification, whose goal is to obtain accurate discriminators for a special class, has been studied. Popular uses of this technique include anomaly detection, which is widely used for anomaly detection. Anomaly detection is a well-studied area of research that aims to learn a model that accurately describes "normality". It has many applications, such as risk assessment for security purposes in many fields, health, and medical risk. However, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point, making the layer look "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict the applied transformation. The result could be considered anomalous if the discriminator fails to predict the transformation accurately.<br />
<br />
Thus, in this paper, a new approach called Deep Robust One-Class Classification (DROCC) was presented to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low-dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of the art methodologies to tackle these kinds of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendricks et al.,2019a) [1]. This work is based on learning the salient geometric structure of typical data by applying specific transformations to the input data and training the discriminator to predict the applied transformation. This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This such work has proposed some heuristics to mitigate issues like setting the bias to zero but it is often insufficient in practice. This method suffers from the fundamental drawback of representation collapse, where the learned transformation might map all the points to a single point (like the origin), leading to a degenerate solution and poor discrimination between normal points and the anomalous points.<br />
<br />
3. Approach based on balancing unbalanced training datasets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers: points which are not following a typical distribution. The following image provides a visual representation of an outlier/anomaly. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Schölkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods, and Side-information based AD. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
'''AD via Generative Modeling:''' involves deep autoencoders and GAN based methods and have been deeply studied. But, this method solves a much harder problem than required and reconstructs the entire input during the decoding step.<br />
<br />
'''Deep One-Class SVM:''' Deep SVDD attempts to learn a neural network which maps data into a hypersphere. Mappings which fall within the hypersphere are considered "normal". It was the first method to introduce deep one-class classification for the purpose of anomaly detection, but is impeded by representation collapse.<br />
<br />
'''Transformations based methods:''' Are more recent methods that are based on self-supervised training. The training process of these methods applies transformations to the regular points and training the classifier to identify the transformations used. The model relies on the assumption that a point is normal iff the transformations applied to the point can be identified. Some proposed transformations are as simple as rotations and flips, or can be handcrafted and much more complicated. The various transformations that have been proposed are heavily domain dependent and are hard to design.<br />
<br />
'''Side-information based AD:''' incorporate labelled anomalous data or out-of-distribution samples. DROCC makes no assumptions regarding access to side-information.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">'''Figure 1'''</div><br />
<br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. We observe from here that DROCC is able to capture the manifold accurately; whereas the classical methods, OC-SVM and DeepSVDD perform poorly as they both try to learn a minimum enclosing ball for the whole set of positive data points. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
The algorithm that was used to train the model is laid out below in pseudocode.<br />
<center><br />
[[File:DROCCtrain.png]]<br />
</center><br />
<br />
For a DNN <math>f_\theta: \mathbb{R}^d \to \mathbb{R}</math> that is parameterized by a set of parameters <math>\theta</math>, DROCC estimates <math>\theta^{dr} = \min_\theta\ell^{dr}(\theta)</math> where <br />
$$\ell^{dr}(\theta) = \lambda\|\theta\|^2 + \sum_{i=1}^n[\ell(f_\theta(x_i),1)+\mu\max_{\tilde{x}_i \in N_i(r)}\ell(f_\theta(\tilde{x}_i),-1)]$$<br />
Here, <math>N_i(r) = \{\|\tilde{x}_i-x_i\|_2\leq\gamma\cdot r; r \leq \|\tilde{x}_i - x_j\|, \forall j=1,2,...n\}</math> contains all the points that are at least distance <math>r</math> from the training points. The <math>\gamma \geq 1</math> is a regularization term, and <math>\ell:\mathbb{R} \times \mathbb{R} \to \mathbb{R}</math> is a loss function. The <math>x_i</math> are normal points that should be classified as positive and the <math>\tilde{x}_i</math> are anomalous points that should be classified as negative. This formulation is a saddle point problem.<br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], DROCC–LF, an outlier-exposure style extension of DROCC was proposed. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">'''Figure 2: AUC result'''</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high as 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">'''Figure 3: F1-Score'''</div><br />
<br />
Figure 3 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from the UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about an 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">'''Figure 4: Sample positives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).'''</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on the MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">'''Figure 5: OCLN on Audio Commands.'''</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous, etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 5 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional sub-manifold inside the feature space, and the anomalous points are characterized via their Euclidean distance from the sub-manifold. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. It will also be interesting to see if one change from using <math>\ell_{2}</math> Euclidean distance to other distances. When the low-dimensional manifold is highly non-linear, using the local linear distance to characterize anomalous points might fail.<br />
<br />
5. This is a nice summary and the authors introduce clearly on the performance of DROCC. It is nice to use Alexa as an example to catch readers' attention. I think it will be nice to include the algorithm of the DROCC or the architecture of DROCC in this summary to help us know the whole view of this method. Maybe it will be interesting to apply DROCC in biomedical studies? since one-class classification is often used in biomedical studies.<br />
<br />
6. The training method resembles adversarial learning with gradient ascent, however, there is no evaluation of this method on adversarial examples. This is quite unusual considering the paper proposed a method for robust one-class classification, and can be a security threat in real life in critical applications.<br />
<br />
7. The underlying idea behind OCLN is very similar to how neural networks are implemented in recommender systems and trained over positive/negative triplet models. In that case as well, due to the nature of implicit and explicit feedback, positive data tends to dominate the system. It would be interesting to see if insights from that area could be used to further boost the model presented in this paper.<br />
<br />
8. The paper shows the performance of DROCC being evaluated for time series data. It is interesting to see high AUC scores for DROCC against baselines like nearest neighbours and REBMs.Because detecting abnormal data in time series datasets is not common to practice.<br />
<br />
9. Figure1 presented results on a simple 2-D sine wave dataset to visualize the kind of classifiers learnt by DROCC. And the 1a is the positive data lies on a 1-D manifold. We can see from 1b that DROCC is able to capture the manifold accurately.<br />
<br />
10. In the MNIST 0 vs. 1 Classification dataset, why is 1 the only digit that is considered an anomoly? Couldn't all of the non-0 digits be left in the dataset to serve as "anomolies"?<br />
<br />
11. For future work the authors suggest considering DROCC for a low curvature manifold but do not motivate the benefits of such a direction.<br />
<br />
12. One of the problems is that in this model we might need to map all the points to one point to make the layer looks "perfect". However, this might not be a good choice since each point is distinct and if we map them together to one point, then this point cannot tell everything. If authors can specify more details on this it would be better.<br />
<br />
13. This project introduced DROCC for “one-class” classification. It will be interesting if such kind of classification can be compared with any other classification such as binary classification, etc. If “one-class” classification would be more speedy than the others.<br />
<br />
14. The dimensions and feature values must be so different across datasets in different domains. I would love to see how this algorithm is performing so well applied on different domains as it is mentioned that it could be used on datasets including images, audio, time-series, etc.<br />
<br />
15. It would be interesting to show the performance of DROCC against popular models used for outlier prediction such as PCA, EVA, etc. Perhaps show their accuracy scores so we can better compare.<br />
<br />
16. It would be greater if an visualization of how much performance DROCC improved compare to traditional binary classifier like SVM, isolation Forest.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=48289Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-30T04:32:47Z<p>D26ma: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology can help researchers quickly detect genetic information and is widely used to analyze genetic diseases. Researchers use this technology to compare normal and abnormal cancerous tissues to gain insights into the pathology of cancer. <br />
<br />
However, due to the high dimensionality of the gene expressions, the accuracy, and computation time of the model might be affected. The following two approaches are adopted to cope with this problem: the feature selection method and the feature creating method. The formal one, similar to the well-known principal component analysis method, aims to focus on key features and ignore minor noises. On the other hand, the latter one, similar to scale-invariant feature transformations, aims to combine existing features or map them to a new low-dimensional space.<br />
<br />
Compared to neural network models presented by others, this paper is specifically designed for predicting cancer using gene expression data. Thus, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions.<br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function such as sigmoid or rectified linear activation functions. To train the network, the inputs are fed forward and the activation function value is calculated at every neuron. The difference between the output of the neural network and the actual value is what we call an error.<br />
The backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and a different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
Cancer prediction models often contain more than 1 method to achieve high prediction accuracy with a more accurate prognosis and it also aims to reduce the cost of patients.<br />
<br />
High dimensionality and spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
<br />
The first is called the ROC curve. ROC curves, receiver operating characteristic curves, are graphs that show the true positive rate against the false-positive rate [4]. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is considered acceptable when its ROC is greater than 0.7. <br />
<br />
A different machine learning problem is predicting the survival time. The performance of a model that predicts survival time can be measured using two metrics. The problem of predicting survival time can be seen as a ranking problem, where survival times of different subjects are ranked against each other. CI (Concordance Index) is a measure of how good a model ranks survival times and explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. We can express the ordering of survival times in an order graph <math display="inline">G = (V, E)</math> where the vertices <math display="inline">V</math> are the individual survival times, and the edges <math display="inline">E_{ij}</math> from individual <math display="inline">i</math> to <math display="inline">j</math> indicate that <math display="inline">T_i < T_j</math>, where <math display="inline">T_i, T_j</math> are the survival times for individuals <math display="inline">i,j</math> respectively. Then we can write <math display="inline">CI = \frac{1}{|E|}\sum_{i,j}I(f(i) < f(j))</math> where <math display="inline">|E|</math> is the number of edges in the order graph (ie. the total number of comparable pairs), and <math display="inline">I(f(i) < f(j)) = 1</math> if the predictor <math display="inline">f</math> correctly ranks <math display="inline">T_i < T_j</math> [3].<br />
<br />
Another metric is the Brier score, which measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy. It is defined as <math display="inline">\frac{1}{n}\sum_{i=1}^n(f_i - o_i)^2</math> where <math display="inline">f_i</math> is the predicted survival rate, and <math display="inline">o_i</math> is the observed survival rate [2].<br />
<br />
== Neural network-based cancer prediction models ==<br />
An extensive search relevant to neural network-based cancer prediction was performed using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, and only articles between 2013 and 2018 with available accessibility were considered. The chosen papers covered cancer classification, discovery, survivability prediction, and statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive, and clustering for chosen papers. <br />
<br />
[[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p-value > <math>10^{-5}</math> to remove some unwanted technical variation and <math>\log_2</math> transformations. Statistical methods, neural networks, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the dataset's features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcome the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to a sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups, and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that neural network methods are used for filtering, predicting, and clustering in cancer prediction. <br />
<br />
*''filtering'': Filter the gene expressions to eliminate noise or reduce dimensionality. Then use the resulted features with statistical methods or machine learning classification and clustering tools as figure 2 indicates.<br />
<br />
*''predicting'': Extract features and improve the accuracy of prediction (classification).<br />
<br />
*''clustering'': Divide the gene expressions or samples based on similarity.<br />
<br />
[[File:filtering gane.png]]<br />
<br />
All the neurons in the neural network work together as feature detectors to learn the features from the input. In order to categorize a neural network as filtering, predicting, or clustering method, we looked at the overall role that network provided within the framework of cancer prediction. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithms such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group for a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering, and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)\log(O(x)) + (1 - I(x))\log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders, and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has a different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient descent optimization, Adam optimizer).<br />
<br />
Overfitting is a major problem that most autoencoders need to deal with to achieve high efficiency of the extracted features. Regularization, dropout, and sparsity are common solutions.<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The conventional methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoost or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string <math>C'k</math> of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied to identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as a classifier is better than those of SVM, logistic regression, naïve Bayes, classification trees, and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without a backpropagation mechanism, is one of the traditional model-based techniques to be applied to gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment or both. The high dimensionality of gene expression samples poses a problem for traditional clustering algorithms such as k-means clustering, which uses a distance function to separate samples. Such an approach is not viable for high dimensional datasets. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken the gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides a double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a high mortality rate that kills millions of people every year, and it’s essential to analyze gene expression for discovering gene abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce the dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures in comparison to shallow architectures for best practice, as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In the binary case, the network architecture has only one or two output neurons that diagnose a given sample as cancerous or non-cancerous. In comparison, the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model proved efficient capability in predicting cancer subtypes, as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the assembling, clustering and projective clustering is more accurate than using single-point clustering algorithms, such as SOM, since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. (4) Augmentation of the dataset to produce more "observations".<br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: In Braga-Neto and Dougherty's research, they have investigated several model evaluation methods: cross-validation, substitution and bootstrap methods. The cross-validation was found to be unreliable for small size data since it displayed excessive variance. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms, data and methodology. Hence, the query used for getting the dataset should be stated.<br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. The authors showed that Neural Network filtering methods are a way of reducing the dimensionality of the gene expressions, as well as removing their noise for better model fitting. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy. The authors discovered that there is no algorithm available to concretely determine an optimal number of hidden layers or nodes and found that many papers simply implemented a trial and error method to reduce loss in the model.<br />
<br />
==Critiques==<br />
<br />
While results indicate that the functionality of the neural network determines its general architecture, the decision on the number of hidden layers, neurons, hypermeters, and learning algorithms is made using trial-and-error techniques. Therefore improvements in this area of the model might need to be explored in order to obtain better results and in order to make more convincing statements.<br />
<br />
An issue that one must be mindful of is the underlying distribution of data. Cancer is an extremely complex genetic disease and the predictions would depend on so many variables, a number of which will not even be present in the dataset as they might have been collected. So there is a need for extensive validation when it comes to applying deep learning methods to cancer-related data.<br />
<br />
In the field of medical sciences and molecular biology, interpretability of results is imperative as often experts seek not just to solve the issue at hand but to understand the causal relationships. Having a high ROC value may not necessarily convince other experts on the validity of the finding because the underlying details of cancer symptoms have been abstracted in a complex neural network as a black box. However, the neural network clustering method suggested in this paper does offer a good compromise because it enables humans to visual low-level features but still gives experts the control on making various predictions using well-studied traditional techniques.<br />
<br />
With high dimensionality features, kernel SVM is another option for cancer prediction. Jiang et. al. developed a Hadamard Kernel for predicting breast cancer using gene expression data, and it utilizes the Kernel trick to avoid high computational efforts (link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5763304/). Compared against linear, quadratic, RBF and correlation kernels, Hadamard Kernel performs best with the highest averaged area under the ROC curve (AUC) value. It may be interesting to compare the performance and accuracy between the Hadamard Kernel and cancer prediction models with various number of hidden layers and neurons.<br />
<br />
Although the authors presented technical details about data processing, training approaches evaluation metric, and addressed many practical issues that can be considered for cancer prediction, no novel methods or models are proposed. It's more like a proof-of-concept about the feasibility of different models on cancer prediction.<br />
<br />
As the authors indicate neural networks would be a useful tool for cancer prediction models, the article is lacking an example for implementing neural networks to provide persuasive support for their arguments.<br />
<br />
The inheritance of cancer is complex and changeable. The predicted variables are therefore very complicated, so for the model of the learning data set, a more adequate training set is needed to learn. And multi-party verification of the learned model.<br />
<br />
The authors mentioned many different neural network models and compared them. It would be better if more details of a commonly applied model with relatively high accuracy could be given such as how the model is built. An article named Convolutional neural network models for cancer type prediction based on gene expression gives explanations of CNN in detail.<br />
<br />
The authors briefly discussed methods and algorithms being used in the presented paper in their summary. However, a very little amount of technical details were provided to the readers. The summary itself is lacking specific examples for the aforementioned algorithms, and datasets which were used in the original analysis were only introduced in one or two sentences. As a result, the summary and conclusion appear to be unconvincing to the readers.<br />
<br />
PCA can still be used as an initial preprocessing step even if it is used in a neural network whose data dimension is reduced. By merging the PCA component with a random number of original functions, some good techniques can be adopted to enable the network to capture more useful relationships<br />
<br />
The key part of this model is to extract the features from the model. However, cancer may depends on many explanatory variables. Thus how do we know which feature should we extract in the data preprocessing. Since there are correlations between each variables. Authors did not specify this situation.<br />
<br />
From the biology side of view, gene expression is really complicated. Thus reducing dimension may or may not be the best way of predicting cancer, and this should be a controversial topic.<br />
<br />
==Reference==<br />
[1] Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.<br />
<br />
[2] Brier GW. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 78: 1–3<br />
<br />
[3] Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar. On ranking in survival analysis: Bounds on the concordance index. In Advances in Neural Information Processing Systems (2008), pp. 1209-1216<br />
<br />
[4] Google Developers. (2020 February 10). ''Classification: ROC Curve and AUC.'' Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc<br />
<br />
[5] Mostavi, M., Chiu, YC., Huang, Y. et al. Convolutional neural network models for cancer type prediction based on gene expression. ''BMC Med Genomics'' '''13''', 44 (2020). https://doi.org/10.1186/s12920-020-0677-2</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=48283Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-30T04:27:01Z<p>D26ma: /* Neural network-based cancer prediction models */</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology can help researchers quickly detect genetic information and is widely used to analyze genetic diseases. Researchers use this technology to compare normal and abnormal cancerous tissues to gain insights into the pathology of cancer. <br />
<br />
However, due to the high dimensionality of the gene expressions, the accuracy, and computation time of the model might be affected. The following two approaches are adopted to cope with this problem: the feature selection method and the feature creating method. The formal one, similar to the well-known principal component analysis method, aims to focus on key features and ignore minor noises. On the other hand, the latter one, similar to scale-invariant feature transformations, aims to combine existing features or map them to a new low-dimensional space.<br />
<br />
Compared to neural network models presented by others, this paper is specifically designed for predicting cancer using gene expression data. Thus, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions.<br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function such as sigmoid or rectified linear activation functions. To train the network, the inputs are fed forward and the activation function value is calculated at every neuron. The difference between the output of the neural network and the actual value is what we call an error.<br />
The backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and a different number of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
Cancer prediction models often contain more than 1 method to achieve high prediction accuracy with a more accurate prognosis and it also aims to reduce the cost of patients.<br />
<br />
High dimensionality and spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
<br />
The first is called the ROC curve. ROC curves, receiver operating characteristic curves, are graphs that show the true positive rate against the false-positive rate [4]. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is considered acceptable when its ROC is greater than 0.7. <br />
<br />
A different machine learning problem is predicting the survival time. The performance of a model that predicts survival time can be measured using two metrics. The problem of predicting survival time can be seen as a ranking problem, where survival times of different subjects are ranked against each other. CI (Concordance Index) is a measure of how good a model ranks survival times and explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. We can express the ordering of survival times in an order graph <math display="inline">G = (V, E)</math> where the vertices <math display="inline">V</math> are the individual survival times, and the edges <math display="inline">E_{ij}</math> from individual <math display="inline">i</math> to <math display="inline">j</math> indicate that <math display="inline">T_i < T_j</math>, where <math display="inline">T_i, T_j</math> are the survival times for individuals <math display="inline">i,j</math> respectively. Then we can write <math display="inline">CI = \frac{1}{|E|}\sum_{i,j}I(f(i) < f(j))</math> where <math display="inline">|E|</math> is the number of edges in the order graph (ie. the total number of comparable pairs), and <math display="inline">I(f(i) < f(j)) = 1</math> if the predictor <math display="inline">f</math> correctly ranks <math display="inline">T_i < T_j</math> [3].<br />
<br />
Another metric is the Brier score, which measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy. It is defined as <math display="inline">\frac{1}{n}\sum_{i=1}^n(f_i - o_i)^2</math> where <math display="inline">f_i</math> is the predicted survival rate, and <math display="inline">o_i</math> is the observed survival rate [2].<br />
<br />
== Neural network-based cancer prediction models ==<br />
An extensive search relevant to neural network-based cancer prediction was performed using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, and only articles between 2013 and 2018 with available accessibility were considered. The chosen papers covered cancer classification, discovery, survivability prediction, and statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive, and clustering for chosen papers. <br />
<br />
[[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p-value > <math>10^{-5}</math> to remove some unwanted technical variation and <math>\log_2</math> transformations. Statistical methods, neural networks, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the dataset's features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcome the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to a sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups, and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that neural network methods are used for filtering, predicting, and clustering in cancer prediction. <br />
<br />
*''filtering'': Filter the gene expressions to eliminate noise or reduce dimensionality. Then use the resulted features with statistical methods or machine learning classification and clustering tools as figure 2 indicates.<br />
<br />
*''predicting'': Extract features and improve the accuracy of prediction (classification).<br />
<br />
*''clustering'': Divide the gene expressions or samples based on similarity.<br />
<br />
[[File:filtering gane.png]]<br />
<br />
All the neurons in the neural network work together as feature detectors to learn the features from the input. In order to categorize a neural network as filtering, predicting, or clustering method, we looked at the overall role that network provided within the framework of cancer prediction. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithms such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group for a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering, and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)\log(O(x)) + (1 - I(x))\log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders, and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has a different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient descent optimization, Adam optimizer).<br />
<br />
Overfitting is a major problem that most autoencoders need to deal with to achieve high efficiency of the extracted features. Regularization, dropout, and sparsity are common solutions.<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The conventional methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoost or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string <math>C'k</math> of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied to identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as a classifier is better than those of SVM, logistic regression, naïve Bayes, classification trees, and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without a backpropagation mechanism, is one of the traditional model-based techniques to be applied to gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment or both. The high dimensionality of gene expression samples poses a problem for traditional clustering algorithms such as k-means clustering, which uses a distance function to separate samples. Such an approach is not viable for high dimensional datasets. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken the gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides a double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a high mortality rate that kills millions of people every year, and it’s essential to analyze gene expression for discovering gene abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce the dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures in comparison to shallow architectures for best practice, as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In the binary case, the network architecture has only one or two output neurons that diagnose a given sample as cancerous or non-cancerous. In comparison, the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model proved efficient capability in predicting cancer subtypes, as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the assembling, clustering and projective clustering is more accurate than using single-point clustering algorithms, such as SOM, since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. (4) Augmentation of the dataset to produce more "observations".<br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: In Braga-Neto and Dougherty's research, they have investigated several model evaluation methods: cross-validation, substitution and bootstrap methods. The cross-validation was found to be unreliable for small size data since it displayed excessive variance. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms, data and methodology. Hence, the query used for getting the dataset should be stated.<br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. The authors showed that Neural Network filtering methods are a way of reducing the dimensionality of the gene expressions, as well as removing their noise for better model fitting. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy. The authors discovered that there is no algorithm available to concretely determine an optimal number of hidden layers or nodes and found that many papers simply implemented a trial and error method to reduce loss in the model.<br />
<br />
==Critiques==<br />
<br />
While results indicate that the functionality of the neural network determines its general architecture, the decision on the number of hidden layers, neurons, hypermeters, and learning algorithms is made using trial-and-error techniques. Therefore improvements in this area of the model might need to be explored in order to obtain better results and in order to make more convincing statements.<br />
<br />
An issue that one must be mindful of is the underlying distribution of data. Cancer is an extremely complex genetic disease and the predictions would depend on so many variables, a number of which will not even be present in the dataset as they might have been collected. So there is a need for extensive validation when it comes to applying deep learning methods to cancer-related data.<br />
<br />
In the field of medical sciences and molecular biology, interpretability of results is imperative as often experts seek not just to solve the issue at hand but to understand the causal relationships. Having a high ROC value may not necessarily convince other experts on the validity of the finding because the underlying details of cancer symptoms have been abstracted in a complex neural network as a black box. However, the neural network clustering method suggested in this paper does offer a good compromise because it enables humans to visual low-level features but still gives experts the control on making various predictions using well-studied traditional techniques.<br />
<br />
With high dimensionality features, kernel SVM is another option for cancer prediction. Jiang et. al. developed a Hadamard Kernel for predicting breast cancer using gene expression data, and it utilizes the Kernel trick to avoid high computational efforts (link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5763304/). Compared against linear, quadratic, RBF and correlation kernels, Hadamard Kernel performs best with the highest averaged area under the ROC curve (AUC) value. It may be interesting to compare the performance and accuracy between the Hadamard Kernel and cancer prediction models with various number of hidden layers and neurons.<br />
<br />
Although the authors presented technical details about data processing, training approaches evaluation metric, and addressed many practical issues that can be considered for cancer prediction, no novel methods or models are proposed. It's more like a proof-of-concept about the feasibility of different models on cancer prediction.<br />
<br />
As the authors indicate neural networks would be a useful tool for cancer prediction models, the article is lacking an example for implementing neural networks to provide persuasive support for their arguments.<br />
<br />
The inheritance of cancer is complex and changeable. The predicted variables are therefore very complicated, so for the model of the learning data set, a more adequate training set is needed to learn. And multi-party verification of the learned model.<br />
<br />
The authors mentioned many different neural network models and compared them. It would be better if more details of a commonly applied model with relatively high accuracy could be given such as how the model is built. An article named Convolutional neural network models for cancer type prediction based on gene expression gives explanations of CNN in detail.<br />
<br />
The authors briefly discussed methods and algorithms being used in the presented paper in their summary. However, a very little amount of technical details were provided to the readers. The summary itself is lacking specific examples for the aforementioned algorithms, and datasets which were used in the original analysis were only introduced in one or two sentences. As a result, the summary and conclusion appear to be unconvincing to the readers.<br />
<br />
PCA can still be used as an initial preprocessing step even if it is used in a neural network whose data dimension is reduced. By merging the PCA component with a random number of original functions, some good techniques can be adopted to enable the network to capture more useful relationships<br />
<br />
The key part of this model is to extract the features from the model. However, cancer may depends on many explanatory variables. Thus how do we know which feature should we extract in the data preprocessing. Since there are correlations between each variables. Authors did not specify this situation.<br />
<br />
==Reference==<br />
[1] Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.<br />
<br />
[2] Brier GW. 1950. Verification of forecasts expressed in terms of probabilities. Monthly Weather Review 78: 1–3<br />
<br />
[3] Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar. On ranking in survival analysis: Bounds on the concordance index. In Advances in Neural Information Processing Systems (2008), pp. 1209-1216<br />
<br />
[4] Google Developers. (2020 February 10). ''Classification: ROC Curve and AUC.'' Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc<br />
<br />
[5] Mostavi, M., Chiu, YC., Huang, Y. et al. Convolutional neural network models for cancer type prediction based on gene expression. ''BMC Med Genomics'' '''13''', 44 (2020). https://doi.org/10.1186/s12920-020-0677-2</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=48279User:J46hou2020-11-30T04:24:19Z<p>D26ma: /* Introduction */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this paper, the “one-class” classification, whose goal is to obtain accurate discriminators for a special class, has been studied. Popular uses of this technique include anomaly detection, which is widely used for anomaly detection. Anomaly detection is a well-studied area of research that aims to learn a model that accurately describes "normality". It has many applications, such as risk assessment for security purposes in many fields, health, and medical risk. However, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point, making the layer look "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict the applied transformation. The result could be considered anomalous if the discriminator fails to predict the transformation accurately.<br />
<br />
Thus, in this paper, a new approach called Deep Robust One-Class Classification (DROCC) was presented to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low-dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of the art methodologies to tackle these kinds of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendricks et al.,2019a) [1]. This work is based on learning the salient geometric structure of typical data by applying specific transformations to the input data and training the discriminator to predict the applied transformation. This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This such work has proposed some heuristics to mitigate issues like setting the bias to zero but it is often insufficient in practice. This method suffers from the fundamental drawback of representation collapse, where the learned transformation might map all the points to a single point (like the origin), leading to a degenerate solution and poor discrimination between normal points and the anomalous points.<br />
<br />
3. Approach based on balancing unbalanced training datasets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers: points which are not following a typical distribution. The following image provides a visual representation of an outlier/anomaly. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Schölkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods, and Side-information based AD. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
'''AD via Generative Modeling:''' involves deep autoencoders and GAN based methods and have been deeply studied. But, this method solves a much harder problem than required and reconstructs the entire input during the decoding step.<br />
<br />
'''Deep One-Class SVM:''' Deep SVDD attempts to learn a neural network which maps data into a hypersphere. Mappings which fall within the hypersphere are considered "normal". It was the first method to introduce deep one-class classification for the purpose of anomaly detection, but is impeded by representation collapse.<br />
<br />
'''Transformations based methods:''' Are more recent methods that are based on self-supervised training. The training process of these methods applies transformations to the regular points and training the classifier to identify the transformations used. The model relies on the assumption that a point is normal iff the transformations applied to the point can be identified. Some proposed transformations are as simple as rotations and flips, or can be handcrafted and much more complicated. The various transformations that have been proposed are heavily domain dependent and are hard to design.<br />
<br />
'''Side-information based AD:''' incorporate labelled anomalous data or out-of-distribution samples. DROCC makes no assumptions regarding access to side-information.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">'''Figure 1'''</div><br />
<br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. We observe from here that DROCC is able to capture the manifold accurately; whereas the classical methods, OC-SVM and DeepSVDD perform poorly as they both try to learn a minimum enclosing ball for the whole set of positive data points. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
The algorithm that was used to train the model is laid out below in pseudocode.<br />
<center><br />
[[File:DROCCtrain.png]]<br />
</center><br />
<br />
For a DNN <math>f_\theta: \mathbb{R}^d \to \mathbb{R}</math> that is parameterized by a set of parameters <math>\theta</math>, DROCC estimates <math>\theta^{dr} = \min_\theta\ell^{dr}(\theta)</math> where <br />
$$\ell^{dr}(\theta) = \lambda\|\theta\|^2 + \sum_{i=1}^n[\ell(f_\theta(x_i),1)+\mu\max_{\tilde{x}_i \in N_i(r)}\ell(f_\theta(\tilde{x}_i),-1)]$$<br />
Here, <math>N_i(r) = \{\|\tilde{x}_i-x_i\|_2\leq\gamma\cdot r; r \leq \|\tilde{x}_i - x_j\|, \forall j=1,2,...n\}</math> contains all the points that are at least distance <math>r</math> from the training points. The <math>\gamma \geq 1</math> is a regularization term, and <math>\ell:\mathbb{R} \times \mathbb{R} \to \mathbb{R}</math> is a loss function. The <math>x_i</math> are normal points that should be classified as positive and the <math>\tilde{x}_i</math> are anomalous points that should be classified as negative. This formulation is a saddle point problem.<br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], DROCC–LF, an outlier-exposure style extension of DROCC was proposed. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">'''Figure 2: AUC result'''</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high as 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">'''Figure 3: F1-Score'''</div><br />
<br />
Figure 3 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from the UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about an 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">'''Figure 4: Sample positives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).'''</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on the MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">'''Figure 5: OCLN on Audio Commands.'''</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous, etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 5 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional sub-manifold inside the feature space, and the anomalous points are characterized via their Euclidean distance from the sub-manifold. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. It will also be interesting to see if one change from using <math>\ell_{2}</math> Euclidean distance to other distances. When the low-dimensional manifold is highly non-linear, using the local linear distance to characterize anomalous points might fail.<br />
<br />
5. This is a nice summary and the authors introduce clearly on the performance of DROCC. It is nice to use Alexa as an example to catch readers' attention. I think it will be nice to include the algorithm of the DROCC or the architecture of DROCC in this summary to help us know the whole view of this method. Maybe it will be interesting to apply DROCC in biomedical studies? since one-class classification is often used in biomedical studies.<br />
<br />
6. The training method resembles adversarial learning with gradient ascent, however, there is no evaluation of this method on adversarial examples. This is quite unusual considering the paper proposed a method for robust one-class classification, and can be a security threat in real life in critical applications.<br />
<br />
7. The underlying idea behind OCLN is very similar to how neural networks are implemented in recommender systems and trained over positive/negative triplet models. In that case as well, due to the nature of implicit and explicit feedback, positive data tends to dominate the system. It would be interesting to see if insights from that area could be used to further boost the model presented in this paper.<br />
<br />
8. The paper shows the performance of DROCC being evaluated for time series data. It is interesting to see high AUC scores for DROCC against baselines like nearest neighbours and REBMs.Because detecting abnormal data in time series datasets is not common to practice.<br />
<br />
9. Figure1 presented results on a simple 2-D sine wave dataset to visualize the kind of classifiers learnt by DROCC. And the 1a is the positive data lies on a 1-D manifold. We can see from 1b that DROCC is able to capture the manifold accurately.<br />
<br />
10. In the MNIST 0 vs. 1 Classification dataset, why is 1 the only digit that is considered an anomoly? Couldn't all of the non-0 digits be left in the dataset to serve as "anomolies"?<br />
<br />
11. For future work the authors suggest considering DROCC for a low curvature manifold but do not motivate the benefits of such a direction.<br />
<br />
12. One of the problems is that in this model we might need to map all the points to one point to make the layer looks "perfect". However, this might not be a good choice since each point is distinct and if we map them together to one point, then this point cannot tell everything. If authors can specify more details on this it would be better.<br />
<br />
13. This project introduced DROCC for “one-class” classification. It will be interesting if such kind of classification can be compared with any other classification such as binary classification, etc. If “one-class” classification would be more speedy than the others.<br />
<br />
14. The dimensions and feature values must be so different across datasets in different domains. I would love to see how this algorithm is performing so well applied on different domains as it is mentioned that it could be used on datasets including images, audio, time-series, etc.<br />
<br />
15. It would be interesting to show the performance of DROCC against popular models used for outlier prediction such as PCA, EVA, etc. Perhaps show their accuracy scores so we can better compare.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence&diff=48095Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-30T01:37:02Z<p>D26ma: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is becoming easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) like Yelp and Foursquare have become an important part of human’s life. People can share their experiences in locations, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can provide a service, which is called personalized POI recommendation, to give recommendations to users that the location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e., Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) the difficulty of incorporating geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for the self-attentive encoder, while NAD stands for the neighbor-aware decoder. Autoencoder is an unsupervised learning technique that we implement in the neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produces a compressed knowledge representation of the original input. This method will include machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing the power of such recommenders. However, the SAE adaptively differentiates the user preference degrees in multiple aspects.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g., Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation.<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which consists of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
Autoencoder (AE) due to their ability to represent complex data have become very useful in recommendation systems. The primary reason it is used is because with deep neural network and with non-linear activation function it effectively captures the non-linear and non-trivial relationships between users and POI's.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector <math>x_u</math> to hidden representation <math>A_u</math> using weight matrix <math>W^1</math> and the activation function <math>softmax</math> and <math>tanh</math>. The 0's and 1's in <math>x_u</math> indicates whether the user has been to a certain POI. The weight matrix <math>W_a</math> assigns different weights on various features of POIs. <math>A_u</math> is the importance score matrix, where each column is a POI, and each row is the importance level of the <math>n</math> checked in POIs in one aspect.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
<math>\mathbf{Z}_u^{(1)} = \mathbf{A}_u \cdot (\mathbf{W}^{(1)}[L_u])^\top</math> is the multiplication of the importance score matrix and the POI embeddings, and represents the user from <math>d_a</math> aspects. <math>\mathbf{Z}_u^{(1)}</math> is a <math>d_a \times H_1</math> matrix. The aggregation layer then combines multiple aspects that represent the user into one, so that it fits into the Neighbor-Aware decoder: <math>\mathbf{z}_u^{(1)} = a_t(\mathbf{Z}_u^{(1)\top}\mathbf{w}_t + \mathbf{b}_t)</math>.<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surrounds the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location are very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez are very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical for one to understand the meanings behind each evaluation metrics. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it can generate recommendations that are highly relevant to its users.<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
This idea would have many applications, such as suggesting new restaurants to customers in the food delivery service app. Would the improvement in accuracy outweigh the increased complexity of the model when it comes to use in industry?<br />
<br />
It would be nice if the author could if the authors can describe the extensive experiments on the real-world datasets, with different baseline methods and evaluation metrics, to demonstrate the effectiveness of the proposed model. Moreover, show the comparison result in tables vs other methodologies, both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers. In other words, how does this compare to the already established recommendation systems found in large scale applications utilize by companies like Netflix? Why use the proposed method over something simpler such as matrix factorization or collaborative filtering? <br />
<br />
It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, we do not have a clear understanding of how important each component is to the recommendation quality.<br />
<br />
It is recommended that the author can present how the encoder and the decoder help in the process by comparing different models with or without encoder and decoder.<br />
<br />
It would have been better if the author included all figures to explain the sensitivity of parameters more elaborately. In the paper he only included plots showing effects on Gowalla and Foursquare datasets but not other datasets.<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48093Mask RCNN2020-11-30T01:35:29Z<p>D26ma: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision. <br />
Mask R-CNN, extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection. <br />
<br />
<br />
== Visual Perception tasks == <br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification <br />
and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and <br />
regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, a complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stage 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architecture utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabelled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
An interesting application of Mask RCNN would be on face recognization from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48090Mask RCNN2020-11-30T01:31:51Z<p>D26ma: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision. <br />
Mask R-CNN, extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection. <br />
<br />
<br />
== Visual Perception tasks == <br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification <br />
and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and <br />
regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, a complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stage 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architecture utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabelled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=48086Surround Vehicle Motion Prediction2020-11-30T01:28:12Z<p>D26ma: /* Critiques */</p>
<hr />
<div>DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Data ===<br />
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.<br />
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base <br />
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.<br />
<br />
== Future works ==<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=48082Surround Vehicle Motion Prediction2020-11-30T01:23:35Z<p>D26ma: /* Introduction */</p>
<hr />
<div>DROCC: Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that less research is focused on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder. depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Data ===<br />
The real dataset is captured on urban roads in Seoul. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network, suitable for use with sequential data. It can also be used for time-series data, where the pattern of the data depends on the time flow. Also, it can contain feedback loops that allow activations to flow alternately in the loop.<br />
An LSTM avoids the problem of vanishing gradients by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading and speed.<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
we introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than did the base <br />
algorithms. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed, and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm.<br />
<br />
== Future works ==<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle company such as Tesla, Tesla nows already have a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a hot very topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman&diff=48069User:Bsharman2020-11-30T01:16:22Z<p>D26ma: /* Critiques */</p>
<hr />
<div>'''Risk prediction in life insurance industry using supervised learning algorithms'''<br />
<br />
'''Presented By'''<br />
<br />
Bharat Sharman, Dylan Li, Leonie Lu, Mingdao Li<br />
<br />
<br />
==Introduction==<br />
<br />
Risk assessment lies at the core of the life insurance industry. It is extremely important for companies to assess the risk of an application accurately to ensure that those submitted by actual low risk applicants are accepted, whereas applications submitted by high risk applicants are either rejected or placed aside for further review. The types of risks include but are not limited to, a person's smoking status, and family history (such as history of heart disease). If this is not the case, individuals with an extremely high risk profile might be issued policies, often leading to high insurance payouts and large losses for the company. Such a situation is called ‘Adverse Selection’, where individuals who are most likely to be issued a payout are in fact given a policy and those who are not as likely are not, thus causing the company to suffer losses as a result.<br />
<br />
Traditionally, the process of underwriting (deciding whether or not to insure the life of an individual) has been done using actuarial calculations and judgements from underwriters. Actuaries group customers according to their estimated levels of risk determined from historical data. (Cummins J, 2013) However, these conventional techniques are time-consuming and it is not uncommon to take a month to issue a policy. They are expensive as a lot of manual processes need to be executed and a lot of data needs to be imported for the purpose of calculation. <br />
<br />
Predictive analysis is an effective technique that simplifies the underwriting process to reduce policy issuance time and improve the accuracy of risk prediction. In life insurance, this approach is widely used in mortality rates modeling to help underwriters make decisions and improve the profitability of the business. In this paper, the authors used data from Prudential Life Insurance company and investigated the most appropriate data extraction method and the most appropriate algorithm to assess risk.<br />
<br />
==Literature Review==<br />
<br />
<br />
Before a life insurance company issues a policy, it must execute a series of underwriting related tasks (Mishr, 2016). These tasks involve gathering extensive information about the applicants. The insurer has to analyze the employment, medical, family and insurance histories of the applicants and factor all of them into a complicated series of calculations to determine the risk rating of the applicants. On basis of this risk rating, premiums are calculated (Prince, 2016).<br />
<br />
In a competitive marketplace, customers need policies to be issued quickly and long wait times can lead to them switch to other providers (Chen 2016). In addition, the costs of data gathering and analysis can be expensive. The insurance company bears the expenses of the medical examinations and if a policy lapses, then the insurer has to bear the losses of all these costs (J Carson, 2017). If the underwriting process uses predictive analytics, then the costs and time associated with many of these processes will be reduced significantly. <br />
<br />
The key importance of a strong underwriting process for a life insurer is to avoid the risk of adverse selection. Adverse selection occurs in situations when an applicant has more knowledge about their health or conditions than the insurance company, and insurers can incur significant losses if they are systematically issuing policies to high-risk applicants due to this asymmetry of information. In order to avoid adverse selection, a strong classification system that correctly groups applicants into their appropriate risk levels is needed, and this is the motivation for this research. <br />
<br />
<br />
==Methods and Techniques==<br />
<br />
<br />
In Figure 1, the process flow of the analytics approach has been depicted. These stages will now be described in the following sections.<br />
<br />
[[File:Data_Analytics_Process_Flow.PNG]]<br />
<br />
<br />
'''Description of the Dataset'''<br />
<br />
The data is obtained from the Kaggle competition hosted by the Prudential Life Insurance company. It has 59381 applications with 128 attributes. The attributes are continuous and discrete as well as categorical variables. <br />
The data attributes, their types and the description is shown in Table 1 below:<br />
<br />
[[File:Data Attributes Types and Description.png]]<br />
<br />
<br />
'''Data Pre-Processing'''<br />
<br />
In the data preprocessing step, missing values in the data are either imputed or dropped and some of the attributes are either transformed to make the subsequent processing of data easier. This decision is made after determining the mechanism of missingness, that is if the data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR). <br />
<br />
'''Data Exploration using Visual Analytics'''<br />
<br />
The Exploratory Data Analysis (EDA) is composed by uni-variate and bivariate analyses, which allows the researchers to understand different distributions' features. Visual analytics aims to gain insights of data structures by creating charts and graphs for the prediction models. <br />
<br />
'''Dimensionality Reduction'''<br />
<br />
In this paper, there are two methods that have been used for dimensionality reduction – <br />
<br />
1.Correlation based Feature Selection (CFS): This is a feature selection method in which a subset of features from the original features is selected. In this method, the algorithm selects features from the dataset that are highly correlated with the output but are not correlated with each other. The user does not need to specify the number of features to be selected. The correlation values are calculated based on measures such as Pearson’s coefficient, minimum description length, symmetrical uncertainty, and relief. <br />
<br />
2.Principal Components Analysis (PCA): PCA is a feature extraction method that transforms existing features into new sets of features such that the correlation between them is zero and these transformed features explain the maximum variability in the data.<br />
<br />
PCA creates new features based on existing ones, while CFS only selects the best attributes based on the predictive power. Although PCA performs some feature engineering on attributes in data sets, the resulting new features are more complex to explain because it is difficult to derive meanings from principal components. On the other hand, CFS is easier to understand and interpret because the original features have not been merged or modified.<br />
<br />
==Supervised Learning Algorithms==<br />
<br />
<br />
The four algorithms that have been used in this paper are the following:<br />
<br />
1.Multiple Linear Regression: In MLR, the relationship between the dependent and the two or more independent variables is predicted by fitting a linear model. The model parameters are calculated by minimizing the sum of squares of the errors, which captures the average distance between the predicted data points and observed data. The significance of the variables is determined by tests like the F test and the p-values. <br />
<br />
2.REPTree: REPTree stands for reduced error pruning tree. It can build both classification and regression trees, depending on the type of the response variable. In this case, it uses regression tree logic and creates many trees across several iterations. This algorithm develops these trees based on the principles of information gain and variance reduction. At the time of pruning the tree, the algorithm uses the lowest mean square error to select the best tree. <br />
<br />
3.Random Tree (Also known as the Random Forest): A random tree selects some of the attributes at each node in the decision tree and builds a tree based on random selection of data as well as attributes. Random Tree does not do pruning. Instead, it estimates class probabilities based on a hold-out set.<br />
<br />
4.Artificial Neural Network: In a neural network, the inputs are transformed into outputs via a series of layered units where each of these units transforms the input received via a function into an output that gets further transmitted to units down the line. The weights that are used to weigh the inputs are improved after each iteration via a method called backpropagation in which errors propagate backward in the network and are used to update the weights to make the computed output closer to the actual output.<br />
<br />
==Experiments and Results==<br />
<br />
<br />
'''Missing Data Mechanism'''<br />
<br />
Attributes, where more than 30% of Data was missing, were dropped from the analysis. The data were tested for Missing Completely at Random (MCAR), one form of the nature of missing values using the Little Test. The null hypothesis that the missing data was completely random had a p-value of 0 meaning, MCAR was rejected. Then, all the variables were plotted to check how many missing values that they had, and the results are shown in the figure below:<br />
<br />
[[File: Missing Value Plot of Training Data.png]]<br />
<br />
The variables that have the most number of missing variables are plotted at the top and that have the least number of missing variables are plotted at the bottom of the y-axis in the figure above. Missing variables do not seem to have obvious patterns and therefore they are assumed to be Missing at Random (MAR), meaning the tendency for the variables to be missing is not related to the missing data but to the observed data.<br />
<br />
<br />
'''Missing Data Imputation'''<br />
<br />
Assuming that missing data follows a MAR pattern, multiple imputations are used as a technique to fill in the values of missing data. The steps involved in multiple imputation are the following: <br />
<br />
Imputation: The imputation of the missing values is done over several steps and this results in a number of complete data sets. Imputation is usually done via a predictive model like linear regression to predict these missing values based on other variables in the data set.<br />
<br />
Analysis: The complete data sets that are formed are analyzed and parameter estimates and standard errors are calculated.<br />
<br />
Pooling: The analysis results are then integrated to form a final data set that is then used for further analysis.<br />
<br />
<br />
'''Comparison of Feature Selection and Feature Extraction'''<br />
<br />
The Correlation-based Feature Selection (CFS) method was performed using the Waikato Environment for Knowledge Analysis. It was implemented using a Best-first search method on a CfsSubsetEval attribute evaluator. 33 variables were selected out of the total of 117 features. <br />
PCA was implemented via a Ranker Search Method using a Principal Components Attributes Evaluator. Out of the 117 features, those that had a standard deviation of more than 0.5 times the standard deviation of the first principal component were selected and this resulted in 20 features for further analysis. <br />
After dimensionality reduction, this reduced data set was exported and used for building prediction models using the four machine learning algorithms discussed before – REPTree, Multiple Linear Regression, Random Tree, and ANNs. The results are shown in the table below: <br />
<br />
[[File: Comparison of Results between CFS and PCA.png]]<br />
<br />
For CFS, the REPTree model had the lowest MAE and RMSE. For PCA, the Multiple Linear Regression Model had the lowest MAE as well as RMSE. So, for this dataset, it seems that overall, Multiple Linear Regression and REPTree Models are the two best ones with lowest error rates. In terms of dimensionality reduction, it seems that CFS is a better method than PCA for this data set as the MAE and RMSE values are lower for all ML methods except ANNs.<br />
<br />
==Conclusion and Further Work==<br />
<br />
<br />
Predictive Analytics in the Life Insurance Industry is enabling faster customer service and lower costs by helping automate the process of Underwriting, thereby increasing satisfaction and loyalty. <br />
In this study, the authors analyzed data obtained from Prudential Life Insurance to predict risk scores via Supervised Machine Learning Algorithms. The data was first pre-processed to first replace the missing values. Attributes having more than 30% of missing data were eliminated from the analysis. <br />
Two methods of dimensionality reduction – CFS and PCA were used and the number of attributes used for further analysis was reduced to 33 and 20 via these two methods. The Machine Learning Algorithms that were implemented were – REPTree, Random Tree, Multiple Linear Regression, and Artificial Neural Networks. Model validation was performed via ten-fold cross-validation. The performance of the models was evaluated using MAE and RMSE measures. <br />
Using the PCA method, Multiple Linear Regression showed the best results with MAE and RMSE values of 1.64 and 2.06 respectively. With CFS, REPTree had the highest accuracy with MAE and RMSE values of 1.52 and 2.02 respectively. <br />
Further work can be directed towards dealing with all the variables rather than deleting the ones where more than 30% of the values are missing. Customer segmentation, i.e. grouping customers based on their profiles can help companies come up with a customized policy for each group. This can be done via unsupervised algorithms like clustering. Work can also be done to make the models more explainable especially if we are using PCA and ANNs to analyze data. We can also get indirect data about the prospective applicant like their driving behavior, education record, etc to see if these attributes contribute to better risk profiling than the already available data.<br />
<br />
<br />
==Critiques==<br />
<br />
The project built multiple models and had utilized various methods to evaluate the result. They could potentially ensemble the prediction, such as averaging the result of the different models, to achieve a better accuracy result. Another method is model stacking, we can input the result of one model as input into another model for better results. However, they do have some major setbacks: sometimes, the result could be effect negatively (ie: increase the RMSE). In addition, if the improvement is not prominent, it would make the process much more complex thus cost time and effort. In a research setting, stacking and ensembling are definitely worth a try. In a real-life business case, it is more of a trade-off between accuracy and effort/cost. <br />
<br />
In this application, it is not essentially the same thing as classify risky as non-risky or vice versa, if the model misclassified a risky data as non-risky, it may create a large loss to the insurance companies. It is recommended that this issue could be carefully taken care of during the model selection. <br />
<br />
This project only used unsupervised dimensionality reduction techniques to analyze the data and did not explore supervised methods such as linear discriminant analysis, quadratic discriminant analysis, Fisher discriminant analysis, or supervised PCA. The supervisory signal could've been the normalized risk score, which determines whether someone is granted the policy. The supervision might have provided insight on factors contributing to risk that would not have been captured with unsupervised methods.<br />
<br />
In addition, the dimensionality reduction techniques seem to contradict each other. If we do select features based on their correlation, there is no need to do further PCA on the dataset. Even if we perform PCA, the resulting columns would have a similar number of columns compared to the original ones and it would lose interpretability of the features, which is crucial in real-life practices.<br />
<br />
Some of the models mentioned in this article require a large amount of data to perform well (e.g. neural network and tree-based algorithm). In a practical sense, a direct insurer may not have enough data (for certain lines of businesses) to train these models, overfitting could be a serious problem. The computational cost could be another critical issue. On a small dataset, Bayesian methods can be used to incorporate prior information into statistical models to obtain more refined inference. <br />
<br />
Attributes that have more than 30% missing data were dropped. The disadvantage of this is that the dropped variable might affect other variables in the dataset. An alternative approach to dealing with missing data would be the use of Neural Networks, eliminating the need for deletion or imputation depending on the mechanism missingness (Smieja et al 2018).<br />
<br />
The project contains multiple models and requires a lot of data to execute the algorithm well, but the data set is entirely from an insurance company. These data will have incompleteness and one-sided factors leading to inaccurate results.<br />
In the summary part, no specific conclusions and future work are mentioned, and the structure is not obvious and requires specific explanation. <br />
<br />
The project provides an idea for the insurance industry to analyze risks. For the model selection part, I am thinking that it might be an idea to involve more complicated models to detect the baseline for the datasets. Besides, applying more complicated models might work as a justification for the feature selection correctness.<br />
<br />
For sections titled "Missing Data Imputation" and "Missing Data Mechanism", it would be highly significant for the authors to provide more details regarding the experimental results. For instance, the authors mentioned that the null hypothesis which the missing data was completely random had a p-value of 0 meaning, MCAR was rejected. However, no relevant calculation or explanation has been provided to justify such conclusion.<br />
<br />
In the Data pre-processing part, the author did not specify what are MCAR, MAR, and MNAR. Technically speaking, MCAR stands for the missing values are like a random sample of all the cases in the feature. MAR stands for missing values can be predicted using the observed data. And MNAR means the missing values cannot be explained by the observed part of the data.<br />
<br />
This project used supervised Machine learning to predict risked score. In the prepossessing data step, the author used MCAR, MAR, MNAR methods and drop roughly 30% missing data, which might leads the insufficient training data for the model. Meanwhile, all the data came from the insurance company which is subjective. Moreover, it would be better if the data pre-processing part to indicate what will be the main difference among MCAR, MAR and MNAR methods.<br />
<br />
As mentioned in the previous critique, having risky and non-risky as the output is not neccessarily true, not just for the company, but also for the consumer as us.<br />
<br />
==References==<br />
<br />
Chen, T. (2016). Corporate reputation and financial performance of Life Insurers. Geneva Papers Risk Insur Issues Pract, 378-397.<br />
<br />
Cummins J, S. B. (2013). Risk classification in Life Insurance. Springer 1st Edition.<br />
<br />
J Carson, C. E. (2017). Sunk costs and screening: two-part tariffs in life insurance. SSRN Electron J, 1-26.<br />
<br />
Jayabalan, N. B. (2018). Risk prediction in life insurance industry using supervised learning algorithms. Complex & Intelligent Systems, 145-154.<br />
<br />
Mishr, K. (2016). Fundamentals of life insurance theories and applications. PHI Learning Pvt Ltd.<br />
<br />
Prince, A. (2016). Tantamount to fraud? Exploring non-disclosure of genetic information in life insurance applications as grounds for policy recession. Health Matrix, 255-307.<br />
<br />
Śmieja, M., Struski, Ł., Tabor, J., Zieliński, B., & Spurek, P. (2018). Processing of missing data by neural networks. In Advances in Neural Information Processing Systems (pp. 2719-2729).</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=48025User:Gtompkin2020-11-30T00:59:43Z<p>D26ma: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning and data science is dealing with missing and incomplete data. This paper proposes a theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of data by imputation or other commonly used methods in the existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and <math>k</math>-nearest neighbors (k-NN) imputation. In the former (mean imputation), the missing value is replaced by the mean of all available values of that feature in the dataset. In the latter (k-NN imputation), the missing value is also replaced by the mean, however, it is now computed using only the k "closest" samples in the dataset. If the dataset is numerical, Euclidean distance could be used as a measure of "closeness", for example. Other methods for dealing with missing data involve training separate neural networks and extreme learning machines. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Further, the decision function can also be trained using available/visible inputs alone. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine that could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset \{1,...,D\} </math> is a set of attributes with missing data. <math>(x,J)</math> therefore represents an "incomplete" data point for which <math>|J|</math>-many entries are unknown - examples of this could be a list of daily temperature readings over a week where temperature was not recorded on the third day (<math>x\in \mathbb{R}^7, J= \{3\}</math>), an audio transcript that goes silent for certain timespans, or images that are partially masked out (as discussed in the examples).<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center><br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu</math>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + i sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> as the characteristic functions of the two measures coincide. <br />
<br />
A result analogous to Theorem 4.1 for RBF can also be obtained.<br />
<br />
'''Theorem 2.1''' Let <math>\mu, \nu</math> be probabilistic measures. If<br />
$$ RBF_{m,\alpha}(\mu) = RBF_{m,\alpha}(\nu) \text{ for every } m \in \mathbb{R}^D, \alpha > 0,$$<br />
then <math>\nu = \mu</math>.<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures. For example, if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
'''Theorem 2.2''' Let <math>\mu, \nu</math> be probabilistic measures with compact support. Let <math>\mathcal{N}</math> be a family of functions having UAP. If<br />
$$n(\mu) = n(\nu) \text{ for every } n \in \mathcal{N},$$<br />
then <math>\nu = \mu</math>.<br />
<br />
A detailed proof Theorems 2.1 and 2.2 can be found in section 2 of the Supplementary Materials, which can be downloaded [https://proceedings.neurips.cc/paper/2018/hash/411ae1bf081d1674ca6091f8c59a266f-Abstract.html here].<br />
<br />
== Experimental Results ==<br />
The model was applied to three types of algorithms: an Autoencoder (AE), a multilayer perceptron, and a radial basis function network.<br />
<br />
'''Autoencoder'''<br />
<br />
Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The autoencoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask. <br />
<br />
Popular imputation techniques were compared against the conducted experiment:<br />
<br />
''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5. <br />
<br />
''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.<br />
<br />
''dropout:'' Dropped input neutrons with missing values. <br />
<br />
Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on the whole area and area outside the mask. <br />
<br />
The mean square error of reconstruction is used to test each method. The errors calculated over the whole area, inside and outside the mask are shown in Table 1, which indicates the method introduced in this paper is the most competitive.<br />
<br />
[[File:Group3_Table1.png |center]]<br />
<div align="center">Table 1: Mean square error of reconstruction on MNIST incomplete images scaled to [0, 1]</div><br />
<br />
'''Multilayer Perceptron'''<br />
<br />
A multilayer perceptron with 3 ReLU hidden layers was applied to a multi-class classification problem on the Epileptic Seizure Recognition (ESR) data set taken from [3]. Each 178-dimensional vector (out of 11500 samples) is the EEG recording of a given person for 1 second, categorized into one of 5 classes. To generate missing attributes, 25%, 50%, 75%, and 90% of observations were randomly removed. The aforementioned imputation methods were used in addition to Multiple Imputation by Chained Equation (mice) and a mixture of Gaussians (gmm). The former utilizes the conditional distribution of data by Markov chain Monte Carlo techniques to draw imputations. The latter replaces missing features with values sampled from GMM estimated from incomplete data using the EM algorithm. <br />
<br />
Double 5-fold cross-validation was used to report classification results. The classical accuracy measure is usually being used for accessing the results.The model under study outperformed classical imputation methods, which give reasonable results only for a low number of missing values. The method under study performs nearly as well as CE, even though CE had access to complete training data. <br />
<br />
'''Radial Basis Function Network'''<br />
<br />
RBFN can be considered as a minimal architecture implementing our model, which contains only one hidden layer. A cross-entropy function was applied to a softmax in the output layer. Two-class data sets retrieved from the UCI repository [4] with internally missing attributes were used. Since the classification is binary, two additional SVM kernel models which work directly with incomplete data without performing any imputations were included in the experiment:<br />
<br />
''geom:'' The objective function is based on the geometric interpretation of the margin and aims to maximize the margin of each sample in its own subspace [5].<br />
<br />
''karma:'' This algorithm iteratively tunes kernel classifier under low-rank assumptions [6].<br />
<br />
The above SVM methods were combined with RBF kernel function. The number of RBF units was selected in the inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly selected from training data while variances were samples from <math>N(0,1)</math> distribution. For SVM methods, the margin parameter <math>C</math> and kernel radius <math>\gamma</math> were selected from <math>\{2^k :k=−5,−3,...,9\}</math> for both parameters. For karma, additional parameter <math>\gamma_{karma}</math> was selected from the set <math>\{1, 2\}</math>.<br />
<br />
The model under study outperformed imputation techniques in almost all cases. It partially confirms that the use of raw incomplete data in neural networks is a usually better approach than filling missing attributes before the learning process. Moreover, it obtained more accurate results than modified kernel methods, which directly work on incomplete data. The performance of the model was once again comparable to, and in some cases better than CE, which had access to the complete data.<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.<br />
<br />
== Critiques ==<br />
<br />
- A simulation study where the mechanism of missingness is known will be interesting to examine. Doing this will allow us to see when the proposed method is better than existing methods, and under what conditions.<br />
<br />
- This method of imputing incomplete data has many limitations: in most cases when we have a missing data point we are facing a relatively small amount of data that does not require training of a neural network. For a large dataset, missing records does not seem to be very crucial because obtaining data will be relatively easier, and using an empirical way of imputing data such as a majority vote will be sufficient.<br />
<br />
- An interesting application of this problem is in NLP. In NLP, especially Question Answering, there is a problem where a query is given and an answer must be retrieved, but the knowledge base is incomplete. There is therefore a requirement for the model to be able to infer information from the existing knowledge base in order to answer the question. Although this problem is a little more contrived than the one mentioned here, it is nevertheless similar in nature because it requires the ability to probabilistically determine some value which can then be used as a response.<br />
<br />
- For the first sentence in the introduction section, it might be better to use "to deal with missing and incomplete data" rather than use "dealing with missing and incomplete data"<br />
<br />
- Based on my R experience, there is one useful function "knnImputation" under the package DMwR, which is very helpful when dealing with missing and incomplete data. This function uses the k-nearest neighbors to fill in the unknown (NA) values in a data set. The users can modify the number of neighbors required in the calculation of each missing value.<br />
<br />
- The experiments in this paper evaluate this method against low amounts of missing data. It would be interesting to see the properties of this imputation when a majority of the data is missing, and see if this method can outperform dropout training in this setting (dropout is known to be surprisingly robust even at high drop levels).<br />
<br />
- This problem can possibly be applied to face recognition where given a blurry image of a person's face, the neural network can make the image clearer such that the face of the person would be visible for humans to see and also possible for the software to identify who the person is.<br />
<br />
- This novel approach can also be applied to restoring damaged handwritten historical documents. By feeding in a damaged document with portions of unreadable texts, the neural network can add missing information utilizing a trained context encoder<br />
<br />
- It will be interesting to see how this method performs with audio data, i.e. say if there are parts of an audio file that are missing, whether the neural network will be able to learn the underlying distribution and impute the missing sections of speech.<br />
<br />
- In general, data are usually missing in a specific part of the content. For example, old books usually have first couple page or last couple pages that are missing. It would be interesting to see that how the distribution of "missing data" will be applied in those cases.<br />
<br />
- In this paper, the researchers were able to outperform existing imputation methods using neural networks. It would be really nice to see how does the usage of neural networks impact the need for amount of data, and how much more training is required in comparison to the other algorithms provided in this paper. <br />
<br />
- It might be an interesting approach to investigate how the size of missing data may influence the training. For example, in the MNIST AutoEncoder, some algorithms involve masks to generate more general images and avoid overfitting. The approach could compare the result by changing the size of the "missing" part to illustrate to what degree can we ignore the missing data and view them as assistance.<br />
<br />
- It would be nice to see how the method under study outperforms both the Multilayer Perceptron and Radial Basis Function Network methods mentioned in the summary. Since these two methods are placed under the Experimental Results section, it is expected that they consist more justifications and supporting evidence (analytical results, tables, graphs, etc.) then what have been presented, so that it is more convincing for the readers.<br />
<br />
- Both KNN imputation and mean imputation are valid techniques to solve the problem, one possible future study on this topic is to explore which one of the two methods above will perform better given the sparsity of the dataset. Another possible study is that if there are methods that can make better inferences on original input, an example is described in the paper (https://cs.uwaterloo.ca/~ilyas/papers/WuMLSys2020.pdf), where imputation is performed based on learning structural properties of data distributions.<br />
<br />
- This project have many applications in the real world. One of the example is to forecast the average height in Canada. We know that people are born and die every second. Also, many people are not accessible which means we cannot obtain the exactly true data. Thus, we can just feed the observed data to the neural network then we will get the predicted result. This can also be applied to the investment world since the data are changing every second.<br />
<br />
- It is interesting to see how we can fill missing data using machine learning. The standard way to fill missing numeric data is to use mean, and using KNN is really a talent way of doing this.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.<br />
<br />
[3] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907, 2001.<br />
<br />
[4] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.<br />
<br />
[5] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9:1–21, 2008.<br />
<br />
[6] Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. In Proceedings of The 32nd International Conference on Machine Learning, pages 257–266, 2015.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=45992Semantic Relation Classification——via Convolution Neural Network2020-11-23T03:04:51Z<p>D26ma: </p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Network.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8(Hendrickx et al., 2010) studied the 9 relations between word pairs. However, it is not specially for scientific text analysis. Xu et al. (2015a) and Santos et al. (2015) , both of them applied CNN with negative sampling to finish task7.<br />
<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px]]<br />
<br />
This is the architecture of the CNN. We first transform a sentence via Feature embeddings. Basically we transform each sentence into continuous word embeddings:<br />
<br />
[[File:WordPosition.png]]<br />
<br />
<br />
And word position embeddings:<br />
<br />
[[File:Position.png]]<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix and trainable and need to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determinate sentence’s relation. During this process, if we have two entities, we will used the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keep track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})})+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In the machine learning, the most important part is to tune the hyperparameters. Unlike the traditional hyperparameters optimization, there are some<br />
modifications to the model in order to increase the performance on test set. There are 5 modifications:<br />
<br />
1. Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
2. Reversal Indicate Features. It added binary feature.<br />
<br />
3. Custom ACL Embeddings. It embedded word vector to an ACL-specific<br />
corps.<br />
<br />
4. Context words. Within the sentence, it varies size on a context window<br />
around the entity-enclosed text.<br />
<br />
5. Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in the table 3.<br />
<br />
[[File:table3.PNG]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more nature and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increases the score on the cross validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for training set 1.1 is the third submission which did not<br />
use the cross validation as the test set. Instead, it runs a constant number of<br />
training epochs and based on the training data it can be chosen by cross validation. The best submission for training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, early stop cannot always based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly to improve the performance on<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significant better performance than all others. Attention based pooling, up-sampling and data augmentation are also tested, but they barely perform positive incresement on the behaviour.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=45990Semantic Relation Classification——via Convolution Neural Network2020-11-23T03:00:42Z<p>D26ma: </p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Network.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8(Hendrickx et al., 2010) studied the 9 relations between word pairs. However, it is not specially for scientific text analysis. Xu et al. (2015a) and Santos et al. (2015) , both of them applied CNN with negative sampling to finish task7.<br />
<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px]]<br />
<br />
This is the architecture of the CNN. We first transform a sentence via Feature embeddings. Basically we transform each sentence into continuous word embeddings:<br />
<br />
[[File:WordPosition.png]]<br />
<br />
<br />
And word position embeddings:<br />
<br />
[[File:Position.png]]<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix and trainable and need to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determinate sentence’s relation. During this process, if we have two entities, we will used the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keep track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})})+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In the machine learning, the most important part is to tune the hyperparameters. Unlike the traditional hyperparameters optimization, there are some<br />
modifications to the model in order to increase the performance on test set. There are 5 modifications:<br />
<br />
1. Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
2. Reversal Indicate Features. It added binary feature.<br />
<br />
3. Custom ACL Embeddings. It embedded word vector to an ACL-specific<br />
corps.<br />
<br />
4. Context words. Within the sentence, it varies size on a context window<br />
around the entity-enclosed text.<br />
<br />
5. Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in the table 3.<br />
<br />
[[File:table3.PNG]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more nature and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increases the score on the cross validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for training set 1.1 is the third submission which did not<br />
use the cross validation as the test set. Instead, it runs a constant number of<br />
training epochs and based on the training data it can be chosen by cross validation. The best submission for training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, early stop cannot always based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly to improve the performance on<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significant better performance than all others. Attention based pooling, up-sampling and data augmentation are also tested, but they barely perform positive incresement on the behaviour.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=45769Semantic Relation Classification——via Convolution Neural Network2020-11-22T21:12:06Z<p>D26ma: /* Introduction */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Network.<br />
<br />
== Algorithm ==<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significant better performance than all others. Attention based pooling, up-sampling and data augmentation are also tested, but they barely perform positive incresement on the behaviour.<br />
<br />
== References ==</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=45759Semantic Relation Classification——via Convolution Neural Network2020-11-22T21:00:45Z<p>D26ma: /* Conclusions */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made.<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Network.<br />
<br />
== Algorithm ==<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significant better performance than all others. Attention based pooling, up-sampling and data augmentation are also tested, but they barely perform positive incresement on the behaviour.<br />
<br />
== References ==</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=45758Semantic Relation Classification——via Convolution Neural Network2020-11-22T20:59:51Z<p>D26ma: /* Introduction */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made.<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Network.<br />
<br />
== Algorithm ==<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})}+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))}$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Conclusions ==<br />
<br />
== References ==</div>D26mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=42636stat441F212020-10-08T03:47:14Z<p>D26ma: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks] || ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| || || ||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| || || ||<br />
|-<br />
|Week of Nov 16 || || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || || 8|| || || ||<br />
|-<br />
|Week of Nov 16 || || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || || 10|| || || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Netorks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || || 13|| || || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| || || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| || || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || || 21|| || || ||<br />
|-<br />
|Week of Nov 30 || || 22|| || || ||<br />
|-<br />
|Week of Nov 30 || || 23|| || || ||<br />
|-<br />
|Week of Nov 30 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 || || 26|| || || ||<br />
|-<br />
|Week of Nov 30 || || 27|| || || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, DanMeng Cui, ZiJie Jiang, MingKang Jiang || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-</div>D26ma