http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Y2587wan&feedformat=atomstatwiki - User contributions [US]2024-03-29T05:32:00ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN&diff=49630Music Recommender System Based using CRNN2020-12-07T01:16:50Z<p>Y2587wan: /* Critiques/ Insights: */</p>
<hr />
<div>==Introduction and Objective:==<br />
<br />
In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data?<br />
<br />
The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.<br />
<br />
This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.<br />
<br />
The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.<br />
<br />
The following two-method approach for building the recommendation system was followed:<br />
#Make recommendations including genre information extracted from classification algorithms.<br />
#Make recommendations without genre information.<br />
<br />
The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.<br />
<br />
==Methods and Techniques:==<br />
Generally, a music recommender can be divided into three main parts: (i) users, (ii) items, and (iii) user-item matching algorithms. Firstly, a model for a user's music taste is generated based on their profiles. Secondly, item profiling based on editorial, cultural, and acoustic metadata is exploited to increase listener satisfaction. Thirdly, a matching algorithm is employed to recommend personalized music to the listener. Two main approaches are currently available;<br />
<br />
1. Collaborative filtering<br />
<br />
It is based on users' historical listening data and depends on user ratings. Nearest neighbour is the standard method used for collaborative filtering and can be broken into two classes of methods: (i) user-based neighbourhood methods and (ii) item-based neighbourhood methods. <br />
<br />
User-based neighbourhood methods calculate the similarity between the target user and other users, and selects the k most similar. A weighted average of the most similar users' song ratings is then computed to predict how the target user would rate those songs. Songs that have a high predicted rating are then recommended to the user. In contrast, methods that use item-based neighbourhoods calculate similarities between songs that the target user has rated well and songs they have not listened to in order to recommend songs.<br />
<br />
That being said, collaborative filtering faces many challenges. For example, given that each user sees only a small portion of all music libraries, sparsity and scalability become an issue. However, this can be dealt with using matrix factorization. A more difficult challenge to overcome is the fact that users often don't rate songs when they are listening to music. <br />
<br />
2. Content-based filtering<br />
<br />
Content based recommendation systems base their recommendations on the similarity of an items features and features that the user has enjoyed. It has two-steps; (i) Extract audio content features and (ii) predict user preferences.<br />
<br />
However content-based filtering has to overcome the challenge of only being able to predict based on users' existing interests. The model is unable to effectively scale to a user's ever changing music taste.<br />
<br />
In this work, the authors take a content-based approach, as they compare the similarity of audio signal features to make recommendations. To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models. <br />
=== Mel Scale: === <br />
The scale of pitches that are heard by listeners, which translates to equal pitch increments.<br />
<br />
[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]<br />
<br />
=== Short Time Fourier Transform (STFT): ===<br />
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math><br />
<br />
where: <math>w(\tau)</math> is the Hanning smoothing function. The STFT is applied over a specified window length at a certain time allowing the frequency to represented for that given window rather than the entire signal as a typical Fourier Transform would.<br />
<br />
=== Convolutional Neural Network (CNN): ===<br />
A Convolutional Neural Network is a Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data with kernels and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is another technique used to extend the pixels on the edge of the original image to allow the kernel to more accurately capture the borderline pixels. Padding is also used if one wishes the convolved output image to have a certain size. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.<br />
<br />
[[File:Convolution.png|thumb|400px|left|Convolution Operation]]<br />
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]<br />
<br />
=== Convolutional Recurrent Neural Network (CRNN): === <br />
The CRNN is similar to the architecture of a CNN, but with the addition of a Gated Recurrent Unit (GRU), which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. The GRU is used to store more long-term memory and will help train the early hidden layers. GRUs can be thought of as LSTMs but with a forget gate, and has fewer parameters than an LSTM. These gates are used to determine how much information from the past should be passed along onto the future. They are originally aimed to prevent the vanishing gradient problem, since deeper networks will result in smaller and smaller gradients at each layer. The GRU can choose to copy over all the information in the past, thus eliminating the risk of vanishing gradients.<br />
<br />
[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]<br />
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]<br />
<br />
==Data Screening:==<br />
<br />
The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contains 16 different genres. The data is cleaned up by removing low audio quality songs, wrongly labelled genres and those that have multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model. <br />
<br />
[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]<br />
<br />
==Implementation:==<br />
<br />
=== Modeling Neural Networks ===<br />
<br />
As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy. <br />
<br />
In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use binary classier and ADAM optimization to update weights in the training phase, and parameters of <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999</math>. Binary cross-entropy was used as the loss function. <br />
Input spectrogram image are 96x1366. In both the CNN and CRNN models, the data was trained over 100 epochs with a batch size of 50 (limited computing power) and using binary cross-entropy as the loss function. Notable model specific details are below:<br />
<br />
'''CNN'''<br />
* Five convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, and ReLU activation<br />
* Max pooling layers <br />
* The sigmoid function was used as the output layer<br />
<br />
'''CRNN'''<br />
* Four convolutional layers with 3x3 kernel (which construct a 2D temporal pattern - two layers of RNNs with Gated Recurrent Units), stride 1, padding, batch normalization, ReLU activation, and dropout rate 0.1<br />
* Feature maps are N x1x15 (N = number of features maps, 68 feature maps in this case) is used for RNNs.<br />
* 4 Max pooling layers for four convolutional layers with kernel ((2x2)-(3x3)-(4x4)-(4x4)) and same stride<br />
* The sigmoid function was used as the output layer<br />
<br />
The CNN and CRNN architecture is also given in the charts below.<br />
<br />
[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]<br />
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]<br />
<br />
=== Music Recommendation System ===<br />
<br />
The recommendation system utilizes cosine similarity of the extracted features from the neural network to compute similarity. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The feature variables will be used in the cosine similarity to find the best recommendations. <br />
<br />
The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.<br />
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]<br />
<br />
== Evaluation Metrics ==<br />
=== Precision: ===<br />
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)<br />
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre<br />
<br />
=== Recall: ===<br />
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)<br />
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre<br />
<br />
=== F1-Score: ===<br />
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:<br />
<br />
[[File:F1441.png|frame|100px|none|F1-Score]]<br />
<br />
=== Receiver operating characteristics (ROC): ===<br />
* A graphical metric that is used to assess a classification model at different classification thresholds <br />
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k<br />
* Plots the true positive rate versus false positive rate as the classification threshold is varied<br />
<br />
[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]<br />
<br />
=== Area Under the Curve (AUC) ===<br />
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.<br />
<br />
In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.<br />
<br />
[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]<br />
<br />
== Results ==<br />
=== Accuracy Metrics ===<br />
The table below is the accuracy metrics with the classification threshold of 0.5.<br />
<br />
[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]<br />
On average, CRNN outperforms CNN in true positive and false positive cases. In addition, it is very apparent that false positives are much more frequent for songs in the Instrumental genre, perhaps indicating that more pre-processing needs to be done for songs in this genre or that it should be excluded from the analysis completely since most music incorporates instrumental components.<br />
<br />
<br />
[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]<br />
On average, CRNN outperforms CNN in F1-score. <br />
<br />
<br />
[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]<br />
On average, CRNN also outperforms CNN in AUC metric.<br />
<br />
<br />
CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC compared to the CNN classifier.<br />
<br />
=== Evaluation of Music Recommendation System: ===<br />
<br />
* A listening experiment was performed with 30 participants to assess user responses to given music recommendations.<br />
* Participants choose 5 pieces of music they enjoy and the recommender system generates 5 new recommendations. The participants then evaluate the recommendation by recording whether they liked or disliked the music recommendation<br />
* The recommendation system takes two approaches to the recommendation:<br />
** Method one uses only the value of cosine similarity.<br />
** Method two uses the value of cosine similarity and information on music genre.<br />
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:<br />
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]<br />
<br />
Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.<br />
<br />
== Conclusion: ==<br />
<br />
The two two main conclusions obtained from this paper:<br />
<br />
* The music genre should be a key feature to increase the predictive capabilities of the music recommendation system.<br />
<br />
* To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals. <br />
<br />
According to the paper, the authors suggested adding other music features like tempo gram for capturing local tempo as a way to improve the accuracy of the recommender system.<br />
<br />
== Critiques/ Insights: ==<br />
# It would be helpful if authors bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.<br />
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities. <br />
# It would be better if they go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.<br />
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie. <br />
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed. <br />
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.<br />
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.<br />
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.<br />
# Is it possible for CNN and CRNN to identify different songs? The model would be harder to train, based on my experience, the efficiency of CNN in R is not very high, which can be improved for future work.<br />
# According to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?<br />
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.<br />
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other categories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to differentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?<br />
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.<br />
# Looks like for the most genres, CRNN outperforms CNN, but CNN did do better on a few genres (like Jazz), so it might be better to mix them together or might use CNN for some genres and CRNN for the rest.<br />
# Cosine similarity is used to find songs with similar patterns as the input ones from users. That is, feature variables are extracted from the trained neural network model before the classification layer, and used as the basis to find similar songs. One potential problem of this approach is that if the neural network classifies an input song incorrectly, the extracted feature vector will not be a good representation of the input song. Thus, a song that is in fact really similar to the input song may have a small cosine similarity value, i.e. not be recommended. In conclusion, if the first classification is wrong, future inferences based on that is going to make it deviate further from the true answer. A possible future improvement will be how to offset this inference error.<br />
# In the tables when comparing performance and accuracies of the CNN and CRNN models on different genres of music, the researchers claimed that CRNN had superior performance to CNN models. This seemed intuitive, especially in the cases when the differences in accuracies were large. However, maybe the researchers should consider including some hypothesis testing statistics in such tables, which would support such claims in a more rigorous manner.<br />
# A music recommender system that doesn't use the song's meta data such as artist and genre and rather tries to classify genre itself seems unproductive. I also believe that the specific artist matters much more than the genre since within a genre you have many different styles. It just seems like the authors hamstring their recommender system by excluding other relevant data.<br />
# The genres that are posed in the paper are very broad and may not be specific enough to distinguish a listeners actual tastes (ie, I like rock and roll, but not punk rock, which could both be in the "rock" category). It would be interesting to run similar experiments with more concrete and specific genres to study the possibility of improving accuracy in the model.<br />
# This summary is well organized with detailed explanation to the music recommendation algorithm. However, since the data used in this paper is cleaned to buffer the efficiency of the recommendation, there should be a section evaluating the impact of noise on the performance this algorithm and how to minimize the impact.<br />
# This method will be better if the user choose some certain music genres that they like while doing the sign-up process. This is similar to recommending articles on twitter.<br />
# I have some feedback for the "Evaluation of Music Recommendation System" section. Firstly, there can be a brief mention of the participants' background information. Secondly, the summary mentions that "participants choose 5 pieces of music they enjoyed". Are they free to choose any music they like, or are they choosing from a pool of selections? What are the lengths of these music pieces? Lastly, method one and method two are compared against each other. It's intuitive that method two will outperform method one, since method two makes use of both cosine similarity and information on music genre, whereas method one only makes use of cosine similarity. Thus, saying method two outperforms method one is not necessarily surprising. I would like to see more explanation on why these methods are chosen, and why comparing them directly is considered to be fair.<br />
# It would be better to have more comparison with other existing music recommender system.<br />
# In the Collecting Music Data section, the author has indicated that for maintaining the balance of data for each genre that they are choosing to omit some genres and a portion of the dataset. However, how this was done was not explained explicitly which can be a concern for results replication. It would be better to describe the steps and measures taken to ensure the actions taken by the teams are reproducible. <br />
# For cleaning data, for training purposes, the team is choosing to omit the ones with lower music quality. While this is a sound option, it can be adjusted that the ratings for the music are deducted to adjust the balance. This could be important since a poor music quality could mean either equipment failure or corrupt server storage or it was a recording of a live performance that often does not have a perfect studio quality yet it would be loved by many real-life users. This omission is not entirely justified and feels like a deliberate adjustment for later results.<br />
# It would be more convincing if the author could provide more comparison between CRNN and CNN.<br />
# How is the result used to recommend songs within genres? It looks like it only predicts what genre the user likes to listen and recommends one of the songs from that genre. How can this recommender system be used to recommend songs within the same genre?<br />
# This [https://arxiv.org/pdf/2006.15795.pdf paper] implements CRNN differently; the CNN and RNN are separate and their resulting matrices and combined later. Would using this version of the CRNN potentially improve the accuracy?<br />
# This kind of approach can be used in implementing other recommender systems for, like movies, articles, news, websites etc. It would be helpful if the author could explain and generalize the implementation on other forms of recommender systems.<br />
# The accuracy of the genre classifier seemed really low, considering how distinct the genres sound to humans. The authors recommend adding features to the data but these could likely be extracted from the audio signal. Extra preprocessing would likely go a long way to improve the accuracy.<br />
# Since it was mentioned that different genres were used, it would be interesting to know if the model can classify different languages and how it performs with songs in different languages.<br />
# It is possible to extend this application to classifying baroque, classical, and romantic genre music. This can be beneficial for students (and frankly, people of all ages) who are learning about music. What's even more interesting to see is if this algorithm can distinguish music pieces written by classical musicians such as Beethoven, Haydn, and Mozart. Of course, it would take more effort in distinguishing features across the music pieces of these three artists, but it's an area worth exploring.<br />
# In contrast to the mel spectrogram method, the popular streaming app Spotify allows you to view data they collect that includes features about the nature of the of song such as acousticness, danceability, loudness, tempo, etc.<br />
# The authors introduced a good recommendation system. It might be helpful to evaluate just based on how similar users are. Meanwhile, there might be some bias need to be considered in the dataset. PCA might be a good way to analyze the existing pattern.<br />
<br />
== References: ==<br />
Nilashi, M., et.al. ''Collaborative Filtering Recommender Systems''. Research Journal of Applied Sciences, Engineering and Technology 5(16):4168-4182, 2013.<br />
Adiyansjah, Alexander A S Gunawan, Derwin Suhartono, Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks, Procedia Computer Science, https://doi.org/10.1016/j.procs.2019.08.146.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Efficient_kNN_Classification_with_Different_Numbers_of_Nearest_Neighbors&diff=49629Efficient kNN Classification with Different Numbers of Nearest Neighbors2020-12-07T01:11:53Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Cooper Brooke, Daniel Fagan, Maya Perelman<br />
<br />
== Introduction == <br />
Traditional model-based approaches for classification problem requires to train a model on training observations before predicting test samples. In contrast, the model-free k-Nearest Neighbors (KNNs) method classifies observations with a majority rule approach, labeling each piece of test data based on its k closest training observations (neighbors). This method has become very popular due to its relatively robust performance given how simple it is to implement. It is robust because the predicted value is only depend on the label of the closest data and that is not significantly affected by outliers.<br />
<br />
There are two main approaches to conduct kNN classification in respect of the choice for k. The first is to use a fixed k value to classify all test samples, while the second is to use a different k value each time, either for different k values for each test sample or different k values for each class. The former, while easy to implement, has shown to be impractical in real-world machine learning applications. It is more reasonable and practical to select a unique value of k for each test sample to allow for a better fit of the data. Therefore, it is of immense interest to develope an efficient way to determine the optimal k value for each test sample. The authors of this paper presented the kTree and k*Tree methods to solve this research question.<br />
<br />
== Previous Work and Motivation== <br />
<br />
The problem of finding an optimal fixed k value for all test samples is well-studied. Lall and Sharma [9] incorporated a certainty factor measure to solve for an optimal fixed k. This resulted in the conclusion that k should be <math>\sqrt{n}</math> (where n is the number of training samples) when n > 100. The method Song et al.[2] explored involves selecting a subset of the most informative samples from neighbourhoods. Vincent and Bengio [3] took the unique approach of designing a k-local hyperplane distance to solve for k. Premachandran and Kakarala [4] had the solution of selecting a robust k using the consensus of multiple rounds of kNNs. These fixed k methods are valuable however are impractical for data mining and machine learning applications. <br />
<br />
Finding an efficient approach to assigning varied k values has also been previously studied. Tuning approaches such as the ones taken by Zhu et al. as well as Sahugara et al. have been popular. Zhu et al. [5] determined that optimal k values should be chosen using cross validation while Sahugara et al. [6] proposed using Monte Carlo validation to select varied k parameters. Other learning approaches such as those taken by Zheng et al. and Góra and Wojna also show promise. Zheng et al. [7] applied a reconstruction framework to learn suitable k values. Góra and Wojna [8] proposed using rule induction and instance-based learning to learn optimal k-values for each test sample. While all these methods are valid, their processes of either learning an optimal-k-value for each test sample or scanning all training samples for finding nearest neighbors are time-consuming. It is challenging for simultaneously addressing these issues of kNN method including optimal-k-values learning for different samples, time cost reduction, and performance improvement.<br />
<br />
Due to the previously mentioned drawbacks of fixed-k and current varied-k kNN classification, the paper’s authors sought to design a new approach to solve for different k values. The kTree and k*Tree approach seek to calculate optimal values of k while avoiding computationally costly steps such as cross-validation.<br />
<br />
A secondary motivation of this research was to ensure that the kTree method would perform better than kNN using fixed values of k given that running costs would be similar in this instance.<br />
<br />
== Approach == <br />
<br />
<br />
=== kTree Classification ===<br />
<br />
The proposed kTree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_1.png | center | 800x800px]]<br />
<br />
==== Reconstruction ====<br />
<br />
The first step is to use the training samples to reconstruct themselves. The goal of this is to find the matrix of correlations between the training samples themselves, <math>\textbf{W}</math>, such that the distance between an individual training sample and the corresponding correlation vector multiplied by the entire training set is minimized. This least square loss function where <math>\mathbf{X}\in \mathbb{R}^{d\times n} = [x_1,...,x_n]</math> represents the training set which can be written as:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2<br />
\end{aligned}$$<br />
<br />
In addition, regularization term can be added to avoid the issue of singularity and increase the robustness of the reconstruction:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho||\textbf{W}||^2_2<br />
\end{aligned}$$<br />
<br />
<math>l_2</math> regularization is the most commonly used to reduce overfitting for regression, and it is called ridge regression which has a close-form solution where <math>W = (X^TX+\rho I)^{-1}X^TX</math>. However, this objective function with <math>l_2</math> regulairzation does not provide a sparse result. With the goal of increaseing computational efficieny, we follow the literature and adopt the <math>l_1</math> regularization instead:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}||, \textbf{W}\geq 0<br />
\end{aligned}$$<br />
<br />
Generally, the larger the value of <math>\rho_1</math>, the more sparse the weight matrix <math>\textbf{W}</math> is.The least square loss function is then further modified to account for samples that have similar values for certain features yielding similar results. It is penalized with the function: <br />
<br />
$$\frac{1}{2} \sum^{d}_{i,j} ||x^iW-x^jW||^2_2$$<br />
<br />
with sij denotes the relation between feature vectors. It uses a radial basis function kernel to calculate Sij. After some transformations, this second regularization term that has tuning parameter <math>\rho_2</math> is:<br />
<br />
$$\begin{aligned}<br />
R(W) = Tr(\textbf{W}^T \textbf{X}^T \textbf{LXW})<br />
\end{aligned}$$<br />
<br />
where <math>\mathbf{L}</math> is a Laplacian matrix that indicates the relationship between features. The Laplacian matrix, also called the graph Laplacian, is a matrix representation of a graph. <br />
<br />
This gives a final objective function of:<br />
<br />
$$\begin{aligned}<br />
\mathop{min}_{\textbf{W}} \sum_{i=1}^n ||Xw_i - x_i||^2 + \rho_1||\textbf{W}|| + \rho_2R(\textbf{W})<br />
\end{aligned}$$<br />
<br />
Since this is a convex function, an iterative method can be used to find the optimal solution <math>\mathbf{W^*}</math>.<br />
<br />
==== Calculate ''k'' for training set ====<br />
<br />
Each element <math>w_{ij}</math> in <math>\textbf{W*}</math> represents the correlation between the ith and jth training sample so if a value is 0, it can be concluded that the jth training sample has no effect on the ith training sample which means that it should not be used in the prediction of the ith training sample. Consequently, all non-zero values in the <math>w_{.j}</math> vector would be useful in predicting the ith training sample which gives the result that the number of these non-zero elements for each sample is equal to the optimal ''k'' value for each sample.<br />
<br />
For example, if there was a 4x4 training set where <math>\textbf{W*}</math> had the form:<br />
<br />
[[File:Approach_Figure_2.png | center | 300x300px]]<br />
<br />
The optimal ''k'' value for training sample 1 would be 2 since the correlation between training sample 1 and both training samples 2 and 4 are non-zero.<br />
<br />
==== Train a Decision Tree using ''k'' as the label ====<br />
<br />
A decision tree is trained using the traditional ID3 method;<br />
(1) calculate the entropy of every feature in your data set,<br />
(2) split the data-set based on the feature whose entropy is minimized after splitting (in the example below, this was feature a'),<br />
(3) make a decision tree node based on that feature,<br />
(4) repeat steps (1)-(3) recursively on the formed subsets using the remaining features, <br />
replacing the label by the previously learned optimal ''k'' value for each sample. More specifically, whereas in a normal decision tree, the target data are the labels themselves, in the kTree method, the target data is the optimal ''k'' value for each sample that was solved for in the previous step. As a result, the decision tree formed by the kTree method has the following form:<br />
<br />
[[File:Approach_Figure_3.png | center | 300x300px]]<br />
<br />
==== Making Predictions for Test Data ====<br />
<br />
The optimal ''k'' values for each testing sample are easily obtainable using the kTree solved for in the previous step. The only remaining step is to predict the labels of the testing samples by finding the majority class of the optimal ''k'' nearest neighbors across '''all''' of the training data.<br />
<br />
=== k*Tree Classification ===<br />
<br />
The proposed k*Tree method is illustrated by the following flow chart:<br />
<br />
[[File:Approach_Figure_4.png | center | 1000x1000px]]<br />
<br />
Clearly, this is a very similar approach to the kTree as the k*Tree method attempts to sacrifice very little in predictive power in return for a substantial decrease in complexity when actually implementing the traditional kNN on the testing data once the optimal ''k'' values have been found.<br />
<br />
While all steps previous are the exact same, the difference comes from additional data stored in the leaf nodes. k*Tree method not only stores the optimal ''k'' value but also the following information:<br />
<br />
* The training samples that have the same optimal ''k''<br />
* The ''k'' nearest neighbours of the previously identified training samples<br />
* The nearest neighbor of each of the previously identified ''k'' nearest neighbours<br />
<br />
The data stored in each node is summarized in the following figure:<br />
<br />
[[File:Approach_Figure_5.png | center | 800x800px]]<br />
<br />
When testing, the constructed k*Tree is searched for its optimal k values well as its nearest neighbours in the leaf node. It then selects a number of its nearest neighbours from the subset of training samples and assigns the test sample with the majority label of these nearest neighbours.<br />
<br />
In the kTree method, predictions were made based on all of the training data, whereas in the k*Tree method, predicting the test labels will only be done using the samples stored in the applicable node of the tree.<br />
<br />
== Experiments == <br />
<br />
In order to assess the performance of the proposed method against existing methods, a number of experiments were performed to measure classification accuracy and run time. The experiments were run on twenty public datasets provided by the UCI Repository of Machine Learning Data, and contained a mix of data types varying in size, in dimensionality, in the number of classes, and in imbalanced nature of the data. Ten-fold cross-validation was used to measure classification accuracy, and the following methods were compared against:<br />
<br />
# k-Nearest Neighbor: The classical kNN approach with k set to k=1,5,10,20 and square root of the sample size [9]; the best result was reported.<br />
# kNN-Based Applicability Domain Approach (AD-kNN) [11]<br />
# kNN Method Based on Sparse Learning (S-kNN) [10]<br />
# kNN Based on Graph Sparse Reconstruction (GS-kNN) [7]<br />
# Filtered Attribute Subspace-based Bagging with Injected Randomness (FASBIR) [12], [13]<br />
# Landmark-based Spectral Clustering kNN (LC-kNN) [14]<br />
<br />
The experimental results were then assessed based on classification tasks that focused on different sample sizes, and tasks that focused on different numbers of features. <br />
<br />
<br />
'''A. Experimental Results on Different Sample Sizes'''<br />
<br />
The running cost and (cross-validation) classification accuracy based on experiments on ten UCI datasets can be seen in Table I below.<br />
<br />
[[File:Table_I_kNN.png | center | 1000x1000px]]<br />
<br />
The following key results are noted:<br />
* Regarding classification accuracy, the proposed methods (kTree and k*Tree) outperformed kNN, AD-KNN, FASBIR, and LC-kNN on all datasets by 1.5%-4.5%, but had no notable improvements compared to GS-kNN and S-kNN.<br />
* Classification methods which involved learning optimal k-values (for example the proposed kTree and k*Tree methods, or S-kNN, GS-kNN, AD-kNN) outperformed the methods with predefined k-values, such as traditional kNN.<br />
* The proposed k*Tree method had the lowest running cost of all methods. However, the k*Tree method was still outperformed in terms of classification accuracy by GS-kNN and S-kNN, but ran on average 15 000 times faster than either method. In addition, the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method.<br />
<br />
<br />
'''B. Experimental Results on Different Feature Numbers'''<br />
<br />
The goal of this section was to evaluate the robustness of all methods under differing numbers of features; results can be seen in Table II below. The Fisher score, an algorithm that solves maximum likelihood equations numerically [15], was used to rank and select the most information features in the datasets. <br />
<br />
[[File:Table_II_kNN.png | center | 1000x1000px]]<br />
<br />
From Table II, the proposed kTree and k*Tree approaches outperformed kNN, AD-kNN, FASBIR and LC-KNN when tested for varying feature numbers. The S-kNN and GS-kNN approaches remained the best in terms of classification accuracy, but were greatly outperformed in terms of running cost by k*Tree. The cause for this is that k*Tree only scans a subsample of the training samples for kNN classification, while S-kNN and GS-kNN scan all training samples.<br />
<br />
== Conclusion == <br />
<br />
This paper introduced two novel approaches for kNN classification algorithms that can determine optimal k-values for each test sample. The proposed kTree and k*Tree methods can classify the test samples efficiently and effectively with designing a training step that reduces the run time of the test stage and thus enhances the performance. Based on the experimental results for varying sample sizes and differing feature numbers, it was observed that the proposed methods outperformed existing ones in terms of running cost while still achieving similar or better classification accuracies. More future areas of investigation could focus on the improvement of kTree and k*Tree for high-dimensional data.<br />
<br />
== Critiques == <br />
<br />
*The paper only assessed classification accuracy through cross-validation accuracy. However, it would be interesting to investigate how the proposed methods perform using different metrics, such as AUC, precision-recall curves, or in terms of holdout test data set accuracy. <br />
* The authors addressed that some of the UCI datasets contained imbalanced data (such as the Climate and German data sets) while others did not. However, the nature of the class imbalance was not extreme, and the effect of imbalanced data on algorithm performance was not discussed or assessed. Moreover, it would have been interesting to see how the proposed algorithms performed on highly imbalanced datasets in conjunction with common techniques to address imbalance (e.g. oversampling, undersampling, etc.). <br />
*While the authors contrast their kTree and k*Tree approach with different kNN methods, the paper could contrast their results with more of the approaches discussed in the Related Work section of their paper. For example, it would be interesting to see how the kTree and k*Tree results compared to Góra and Wojna varied optimal k method.<br />
<br />
* The paper conducted an experiment on kNN, AD-kNN, S-kNN, GS-kNN,FASBIR and LC-kNN with different sample sizes and feature numbers. It would be interesting to discuss why the running cost of FASBIR is between that of kTree and k*Tree in figure 21.<br />
<br />
* A different [https://iopscience.iop.org/article/10.1088/1757-899X/725/1/012133/pdf paper] also discusses optimizing the K value for the kNN algorithm in clustering. However, this paper suggests using the expectation-maximization algorithm as a means of finding the optimal k value.<br />
<br />
* It would be nice to have a comparison of the running costs of different methods to see how much faster kTree and k*Tree performed<br />
<br />
* It would be better to show the key result only on a summary rather than stacking up all results without screening.<br />
<br />
* In the results section, it was mentioned that in the experiment on data sets with different numbers of features, the kTree and k*Tree model did not achieve GS-kNN or S-kNN's accuracies, but was faster in terms of running cost. It might be helpful here if the authors add some more supporting arguments about the benefit of this tradeoff, which appears to be a minor decrease in accuracy for a large improvement in speed. This could further showcase the advantages of the kTree and k*Tree models. More quantitative analysis or real-life scenario examples could be some choices here.<br />
<br />
* An interesting thing to notice while solving for the optimal matrix <math>W^*</math> that minimizes the loss function is that <math>W^*</math> is not necessarily a symmetric matrix. That is, the correlation between the <math>i^{th}</math> entry and the <math>j^{th}</math> entry is different from that between the <math>j^{th}</math> entry and the <math>i^{th}</math> entry, which makes the resulting W* not really semantically meaningful. Therefore, it would be interesting if we may set a threshold on the allowing difference between the <math>ij^{th}</math> entry and the <math>ji^{th}</math> entry in <math>W^*</math> and see if this new configuration will give better or worse results compared to current ones, which will provide better insights of the algorithm.<br />
<br />
* It would be interesting to see how the proposed model works with highly non-linear datasets. In the event it does not work well, it would pose the question: would replacing the k*Tree with a SVM or a neural network improve the accuracy? There could be experiments to show if this variant would prove superior over the original models.<br />
<br />
* The key results are a little misleading - for example they claim "the kTree had the highest accuracy and it's running cost was lower than any other methods except the k*Tree method" is false. The kTree method had slightly lower accuracy than both GS-kNN and S-kNN and kTree was also slower than LC-kNN<br />
<br />
* I want to point to the discussion on k*Tree's structure. In order for k*Tree to work effectively, its leaf nodes needs to store additional information. In addition to the optimal k value, it also needs to store things like the training samples that have the optimal k, and the k nearest neighbours of the previously identified training samples. How big of am impact does this structure have on storage cost? Since the number of leaf nodes can be large, the storage cost may be large as well. This can potentially make k*tree ineffective to use in practice, especially for very large datasets.<br />
<br />
* It would be better if the author can explain more on KTree method and the similarity of KTree method and KNN method.<br />
<br />
* Even though we are given a table with averages on the accuracy and mean running cost, it would have been nice to see a direct visual comparison in the figures followed below. In addition to comparing to other algorithms, it would be helpful to see the average expected cost of these algorithms to show as control or rather a standard to accuracy and compute cost to assess the overall general expected cost of running such classification algorithm to fully assess its efficacy.<br />
<br />
* It doesn't clearly mention what's the definition/similarity/difference between Ktree and KNN methods. If the authors could put some detailed explanations in the beginning, the flow of this paper would have been much better.<br />
<br />
* It would be better to know if the paper indicates the performance difference between small and large dataset. Would the performance increase be negligible in small features datasets?<br />
<br />
* It would be more clear if the experiment connect with the approach part tightly, like even just mention how to apply the approach to get these results.<br />
<br />
* It would be better if the author had provided several paragraphs discussing the complexity of these models. It seems like the highlight of kTree is that it offers similar performance at a significantly lower cost.<br />
<br />
* The author should compare the complexity of several different algorithms that gets the optimum k value, and discuss the pros and cons of each approach<br />
<br />
* It might be difficult to say k Tree classification is better than kNN. It might require a larger dataset and perform hyperparameter tuning on both methods.<br />
<br />
== References == <br />
<br />
[1] C. Zhang, Y. Qin, X. Zhu, and J. Zhang, “Clustering-based missing value imputation for data preprocessing,” in Proc. IEEE Int. Conf., Aug. 2006, pp. 1081–1086.<br />
<br />
[2] Y. Song, J. Huang, D. Zhou, H. Zha, and C. L. Giles, “IKNN: Informative K-nearest neighbor pattern classification,” in Knowledge Discovery in Databases. Berlin, Germany: Springer, 2007, pp. 248–264.<br />
<br />
[3] P. Vincent and Y. Bengio, “K-local hyperplane and convex distance nearest neighbor algorithms,” in Proc. NIPS, 2001, pp. 985–992.<br />
<br />
[4] V. Premachandran and R. Kakarala, “Consensus of k-NNs for robust neighborhood selection on graph-based manifolds,” in Proc. CVPR, Jun. 2013, pp. 1594–1601.<br />
<br />
[5] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 1, pp. 110–121, Jan. 2011.<br />
<br />
[6] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013.<br />
<br />
[7] S. Zhang, M. Zong, K. Sun, Y. Liu, and D. Cheng, “Efficient kNN algorithm based on graph sparse reconstruction,” in Proc. ADMA, 2014, pp. 356–369.<br />
<br />
[8] X. Zhu, L. Zhang, and Z. Huang, “A sparse embedding and least variance encoding approach to hashing,” IEEE Trans. Image Process., vol. 23, no. 9, pp. 3737–3750, Sep. 2014.<br />
<br />
[9] U. Lall and A. Sharma, “A nearest neighbor bootstrap for resampling hydrologic time series,” Water Resour. Res., vol. 32, no. 3, pp. 679–693, 1996.<br />
<br />
[10] D. Cheng, S. Zhang, Z. Deng, Y. Zhu, and M. Zong, “KNN algorithm with data-driven k value,” in Proc. ADMA, 2014, pp. 499–512.<br />
<br />
[11] F. Sahigara, D. Ballabio, R. Todeschini, and V. Consonni, “Assessing the validity of QSARS for ready biodegradability of chemicals: An applicability domain perspective,” Current Comput.-Aided Drug Design, vol. 10, no. 2, pp. 137–147, 2013. <br />
<br />
[12] Z. H. Zhou and Y. Yu, “Ensembling local learners throughmultimodal perturbation,” IEEE Trans. Syst. Man, B, vol. 35, no. 4, pp. 725–735, Apr. 2005.<br />
<br />
[13] Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. London, U.K.: Chapman & Hall, 2012.<br />
<br />
[14] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, “Efficient kNN classification algorithm for big data,” Neurocomputing, vol. 195, pp. 143–148, Jun. 2016.<br />
<br />
[15] K. Tsuda, M. Kawanabe, and K.-R. Müller, “Clustering with the fisher score,” in Proc. NIPS, 2002, pp. 729–736.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=49628Loss Function Search for Face Recognition2020-12-07T01:05:55Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss:''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss:'''<math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss:'''<math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
*It is interesting to see how loss function impacts face recognition. It would be better to see different evaluations based on different datasets. Not only accuracy is important, but also efficiency is significant.<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_Recognition&diff=49624Loss Function Search for Face Recognition2020-12-07T00:58:04Z<p>Y2587wan: /* Results and Discussion */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss:''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss:'''<math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss:'''<math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors&diff=49617Improving neural networks by preventing co-adaption of feature detectors2020-12-07T00:06:25Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim<br />
<br />
= Improvement Intro =<br />
'''Drop Out Model'''<br />
<br />
In this paper, Hinton et al. introduce a novel way to improve neural networks’ performance, particularly in the case that a large feedforward neural network is trained on a small training set, which causes poor performance and leads to “overfitting” problem. This problem can be reduced by randomly omitting half of the feature detectors on each training case. In fact, By omitting neurons in hidden layers with a probability of 0.5, each hidden unit is prevented from relying on other hidden units being present during training. Hence there are fewer co-adaptations among them on the training data. Called “dropout,” this process is also an efficient alternative to train many separate networks and average their predictions on the test set. <br />
<br />
The intuition for dropout is that if neurons are randomly dropped during training, they can no longer rely on their neighbours, thus allowing each neuron to become more robust. Another interpretation is that dropout is similar to training an ensemble of models, since each epoch with randomly dropped neurons can be viewed as its own model. <br />
<br />
They used the standard, stochastic gradient descent algorithm and separated training data into mini-batches. An upper bound was set on the L2 norm of incoming weight vector for each hidden neuron, which was normalized if its size exceeds the bound. They found that using a constraint, instead of a penalty, forced the model to do a more thorough search of the weight-space, when coupled with the very large learning rate that decays during training.<br />
<br />
'''Mean Network'''<br />
<br />
Their dropout models included all of the hidden neurons, and their outgoing weights were halved to account for the chances of omission. This is called a 'Mean Network'. This is similar to taking the geometric mean of the probability distribution predicted by all <math>2^N</math> networks. Due to this cumulative addition, the correct answers will have higher log probability than an individual dropout network, which also leads to a lower squared error of the network. <br />
<br />
<br />
The models were shown to result in lower test error rates on several datasets: MNIST; TIMIT; Reuters Corpus Volume; CIFAR-10; and ImageNet.<br />
<br />
= MNIST =<br />
The MNIST dataset contains 70,000 digit images of size 28 x 28. To see the impact of dropout, they used 4 different neural networks (784-800-800-10, 784-1200-1200-10, 784-2000-2000-10, 784-1200-1200-1200-10), with the same dropout rates, 50%, for hidden neurons and 20% for visible neurons. Stochastic gradient descent was used with mini-batches of size 100 and a cross-entropy objective function as the loss function. Weights were updated after each minibatch, and training was done for 3000 epochs. An exponentially decaying learning rate <math>\epsilon</math> was used, with the initial value set as 10.0, and it was multiplied by the decaying factor <math>f</math> = 0.998 at the end of each epoch. At each hidden layer, the incoming weight vector for each hidden neuron was set an upper bound of its length, <math>l</math>, and they found from cross-validation that the results were the best when <math>l</math> = 15. Initial weights values were pooled from a normal distribution with mean 0 and standard deviation of 0.01. To update weights, an additional variable, ''p'', called momentum, was used to accelerate learning. The initial value of <math>p</math> was 0.5, and it increased linearly to the final value 0.99 during the first 500 epochs, remaining unchanged after. Also, when updating weights, the learning rate was multiplied by <math>1 – p</math>. <math>L</math> denotes the gradient of loss function.<br />
<br />
[[File:weights_mnist2.png|center|400px]]<br />
<br />
The best published result for a standard feedforward neural network was 160 errors. This was reduced to about 130 errors with 0.5 dropout and different L2 constraints for each hidden unit input weight. By omitting a random 20% of the input pixels in addition to the aforementioned changes, the number of errors was further reduced to 110. The following figure visualizes the result.<br />
[[File:mnist_figure.png|center|500px]]<br />
A publicly available pre-trained deep belief net resulted in 118 errors, and it was reduced to 92 errors when the model was fine-tuned with dropout. Another publicly available model was a deep Boltzmann machine, and it resulted in 103, 97, 94, 93 and 88 when the model was fine-tuned using standard backpropagation and was unrolled. They were reduced to 83, 79, 78, 78, and 77 when the model was fine-tuned with dropout – the mean of 79 errors was a record for models that do not use prior knowledge or enhanced training sets.<br />
<br />
= TIMIT = <br />
<br />
TIMIT dataset includes voice samples of 630 American English speakers varying across 8 different dialects. It is often used to evaluate the performance of automatic speech recognition systems. Using Kaldi, the dataset was pre-processed to extract input features in the form of log filter bank responses.<br />
<br />
=== Pre-training and Training ===<br />
<br />
For pretraining, they pretrained their neural network with a deep belief network and the first layer was built using Restricted Boltzmann Machine (RBM). Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution <math>N(0, 0.01)</math>. Each visible neuron’s variance was set to 1.0 and remained unchanged.<br />
<br />
Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by <math>(1-momentum)</math> and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, <math>p</math> was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to <math>log(p/(1 − p))</math>. Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.<br />
<br />
=== Dropout tuning ===<br />
<br />
The initial weights were set in a neural network from the pretrained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purpose, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.<br />
<br />
=== Classification Test and Performance ===<br />
<br />
A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input.<br />
<br />
Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Thus, neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.<br />
<br />
= Reuters Corpus Volume =<br />
Reuters Corpus Volume I archives 804,414 news documents that belong to 103 topics. Under four major themes - corporate/industrial, economics, government/social, and markets – they belonged to 63 classes. After removing 11 classes with no data and one class with insufficient data, they are left with 50 classes and 402,738 documents. The documents were divided into training and test sets equally and randomly, with each document representing the 2000 most frequent words in the dataset, excluding stopwords.<br />
<br />
They trained two neural networks, with size 2000-2000-1000-50, one using dropout and backpropagation, and the other using standard backpropagation. The training hyperparameters are the same as that in MNIST, but training was done for 500 epochs.<br />
<br />
In the following figure, we see the significant improvements by the model with dropout in the test set error. On the right side, we see that learning with dropout also proceeds smoother. <br />
<br />
[[File:reuters_figure.png|700px|center]]<br />
<br />
= CNN =<br />
<br />
Feed-forward neural networks consist of several layers of neurons where each neuron in a layer applies a linear filter to the input image data and is passed on to the neurons in the next layer. When calculating the neuron’s output, scalar bias a.k.a weights is applied to the filter with nonlinear activation function as parameters of the network that are learned by training data. [[File:cnnbigpicture.jpeg|thumb|upright=2|center|alt=text|Figure: Overview of Convolutional Neural Network]] There are several differences between Convolutional Neural networks and ordinary neural networks. The figure above gives a visual representation of a Convolutional Neural Network. First, CNN’s neurons are organized topographically into a bank and laid out on a 2D grid, so it reflects the organization of dimensions of the input data. Secondly, neurons in CNN apply filters which are local, and which are centered at the neuron’s location in the topographic organization. Meaning that useful metrics or clues to identify the object in an input image which can be found by examining local neighborhoods of the image. Next, all neurons in a bank apply the same filter at different locations in the input image. When looking at the image example, green is an input to one neuron bank, yellow is filter bank, and pink is the output of one neuron bank (convolved feature). A bank of neurons in a CNN applies a convolution operation, aka filters, to its input where a single layer in a CNN typically has multiple banks of neurons, each performing a convolution with a different filter. The resulting neuron banks become distinct input channels into the next layer. The whole process reduces the net’s representational capacity, but also reduces the capacity to overfit.<br />
[[File:bankofneurons.gif|thumb|upright=3|center|alt=text|Figure: Bank of neurons]]<br />
<br />
=== Pooling ===<br />
<br />
Pooling layer summarizes the activities of local patches of neurons in the convolutional layer by subsampling the output of a convolutional layer. Pooling is useful for extracting dominant features, to decrease the computational power required to process the data through dimensionality reduction. The procedure of pooling goes on like this; output from convolutional layers is divided into sections called pooling units and they are laid out topographically, connected to a local neighborhood of other pooling units from the same convolutional output. Then, each pooling unit is computed with some function which could be maximum and average. Maximum pooling returns the maximum value from the section of the image covered by the pooling unit while average pooling returns the average of all the values inside the pooling unit (see example). In result, there are fewer total pooling units than convolutional unit outputs from the previous layer, this is due to larger spacing between pixels on pooling layers. Using the max-pooling function reduces the effect of outliers and improves generalization. Other than that, overlapping pooling makes this spacing between pixels smaller than the size of the neighborhood that the pooling units summarize (This spacing is usually referred as the stride between pooling units). With this variant, pooling layer can produce a coarse coding of the outputs which helps generalization. <br />
[[File:maxandavgpooling.jpeg|thumb|upright=2|center|alt=text|Figure: Max pooling and Average pooling]]<br />
<br />
=== Local Response Normalization === <br />
<br />
This network includes local response normalization layers which are implemented in lateral form and used on neurons with unbounded activations and permits the detection of high-frequency features with a big neuron response. This regularizer encourages competition among neurons belonging to different banks. Normalization is done by dividing the activity of a neuron in bank <math>i</math> at position <math>(x,y)</math> by the equation:<br />
[[File:local response norm.png|upright=2|center|]] where the sum runs over <math>N</math> ‘adjacent’ banks of neurons at the same position as in the topographic organization of neuron bank. The constants, <math>N</math>, <math>alpha</math> and <math>betas</math> are hyper-parameters whose values are determined using a validation set. This technique is replaced by better techniques such as the combination of dropout and regularization methods (<math>L1</math> and <math>L2</math>)<br />
<br />
=== Neuron nonlinearities ===<br />
<br />
All of the neurons for this model use the max-with-zero nonlinearity where output within a neuron is computed as <math> a^{i}_{x,y} = max(0, z^i_{x,y})</math> where <math> z^i_{x,y} </math> is the total input to the neuron. The reason they use nonlinearity is because it has several advantages over traditional saturating neuron models, such as significant reduction in training time required to reach a certain error rate. Another advantage is that nonlinearity reduces the need for contrast-normalization and data pre-processing since neurons do not saturate- meaning activities simply scale up little by little with usually large input values. For this model’s only pre-processing step, they subtract the mean activity from each pixel and the result is a centered data.<br />
<br />
=== Objective function ===<br />
<br />
The objective function of their network maximizes the multinomial logistic regression objective which is the same as minimizing the average cross-entropy across training cases between the true label and the model’s predicted label.<br />
<br />
=== Weight Initialization === <br />
<br />
It’s important to note that if a neuron always receives a negative value during training, it will not learn because its output is uniformly zero under the max-with-zero nonlinearity. Hence, the weights in their model were sampled from a zero-mean normal distribution with a high enough variance. High variance in weights will set a certain number of neurons with positive values for learning to happen, and in practice, it’s necessary to try out several candidates for variances until a working initialization is found. In their experiment, setting a positive constant, or 1, as biases of the neurons in the hidden layers was helpful in finding it.<br />
<br />
=== Training ===<br />
<br />
In this model, a batch size of 128 samples and momentum of 0.9, we train our model using stochastic gradient descent. The update rule for weight <math>w</math> is $$ v_{i+1} = 0.9v_i + \epsilon <\frac{dE}{dw_i}> i$$ $$w_{i+1} = w_i + v_{i+1} $$ where <math>i</math> is the iteration index, <math>v</math> is a momentum variable, <math>\epsilon</math> is the learning rate and <math>\frac{dE}{dw}</math> is the average over the <math>i</math>th batch of the derivative of the objective with respect to <math>w_i</math>. The whole training process on CIFAR-10 takes roughly 90 minutes and ImageNet takes 4 days with dropout and two days without.<br />
<br />
=== Learning ===<br />
To determine the learning rate for the network, it is a must to start with an equal learning rate for each layer which produces the largest reduction in the objective function with power of ten. Usually, it is in the order of <math>10^{-2}</math> or <math>10^{-3}</math>. In this case, they reduce the learning rate twice by a factor of ten before termination of training.<br />
<br />
= CIFAR-10 =<br />
<br />
=== CIFAR-10 Dataset ===<br />
<br />
Removing incorrect labels, The CIFAR-10 dataset is a subset of the Tiny Images dataset with 10 classes. It contains 5000 training images and 1000 testing images for each class. The dataset has 32 x 32 color images searched from the web and the images are labeled with the noun used to search the image.<br />
<br />
[[File:CIFAR-10.png|thumb|upright=2|center|alt=text|Figure 4: CIFAR-10 Sample Dataset]]<br />
<br />
=== Models for CIFAR-10 ===<br />
<br />
Two models, one with dropout and one without dropout, were built to test the performance of dropout on CIFAR-10. All models have CNN with three convolutional layers each with a pooling layer. All of the pooling payers use a stride=2 and summarize a 3*3 neighborhood. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by remaining 2 pooling layers. The first and second pooling layers with <math>N = 9, α = 0.001</math>, and <math>β = 0.75</math> are followed by response normalization layers. A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using filter size of 5×5, all convolutional layers have 64 filter banks.<br />
<br />
Additional changes were made with the model with dropout. The model with dropout enables us to use more parameters because dropout forces a strong regularization on the network. Thus, a fourth weight layer is added to take the input from the previous pooling layer. This fourth weight layer is locally connected, but not convolutional, and contains 16 banks of filters of size 3 × 3 with 50% dropout. Lastly, the softmax layer takes its input from this fourth weight layer.<br />
<br />
Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best published error rate without using transformed data. The model with one additional locally-connected layer and dropout at the last hidden layer produced the error rate of 15.6%.<br />
<br />
= ImageNet =<br />
<br />
===ImageNet Dataset===<br />
<br />
ImageNet is a dataset of millions of high-resolution images, and they are labeled among 1000 different categories. The data were collected from the web and manually labeled using MTerk tool, which is a crowd-sourcing tool provided by Amazon.<br />
Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images may contain multiple objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.<br />
<br />
'''An ambiguous example to classify:'''<br />
<br />
[[File:imagenet1.png|200px|center]]<br />
<br />
When this paper was written, the best score on this dataset was the error rate of 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error rate using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
'''ImageNet Dataset:'''<br />
<br />
[[File:imagenet2.png|400px|center]]<br />
<br />
===Models for ImageNet===<br />
<br />
They mostly focused on the model with dropout because the one without dropout had a similar approach, but there was a serious issue with overfitting. They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. This could reduce the network’s capacity to overfit the training data and helped generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for testing their model patched at the center, four corners, and their horizontal reflections. To maximize the performance on the validation set, this complicated network architecture was used and it was found that dropout was very effective. Also, it was demonstrated that using non-convolutional higher layers with the number of parameters worked well with dropout, but it had a negative impact to the performance without dropout.<br />
<br />
The network contains seven weight layers. The first five are convolutional, and the last two are globally-connected. Max-pooling layers follow the layer number 1,2, and 5. And then, the output of the last globally-connected layer was fed to a 1000-way softmax output layers. Using this architecture, the authors achieved the error rate of 48.6%. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
<br />
[[File:modelh2.png|700px|center]] <br />
<br />
[[File:layer2.png|600px|center]]<br />
<br />
Like the previous datasets, such as the MNIST, TIMIT, Reuters, and CIFAR-10, we also see a significant improvement for the ImageNet dataset. Including complicated architectures like this one, introducing dropout generalizes models better and gives lower test error rates.<br />
<br />
= Conclusion =<br />
<br />
The authors have shown a consistent improvement by the models trained with dropout in classifying objects in the following datasets: MNIST; TIMIT; Reuters Corpus Volume I; CIFAR-10; and ImageNet.<br />
<br />
The authors comment on a theory that sexual reproduction limits biological function to a small number of coadapted genes. The idea is that a given organism is unlikely to receive many coordinated genes from a parent, so will likely die if it relies on many genes to perform a given task. This limits the number of genes required to perform a function, which is like a built-in evolutionary dropout.<br />
<br />
= Critiques =<br />
It is a very brilliant idea to dropout half of the neurons to reduce co-adaptations. It is mentioned that for fully connected layers, dropout in all hidden layers works better than dropout in only one hidden layer. There is another paper Dropout: A Simple Way to Prevent Neural Networks from<br />
Overfitting[https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf] gives a more detailed explanation.<br />
<br />
It will be interesting to see how this paper could be used to prevent overfitting of LSTMs.<br />
<br />
This paper focused more on CV tasks, it will be interesting to have some discussion on NLP tasks<br />
<br />
Firstly, it is a very interested topic of classification by "dropout" CNN method(omitting neurons in hidden layers). If the author can briefly explain the advantages of this method in processing image data in theory, it will be easier for readers to understand. Also, how to deal with overfitting issue would be valuable.<br />
<br />
The authors mention that they tried various dropout probabilities and that the majority of them improved the model's generalization performance, but that more extreme probabilities tended to be worse which is why a dropout rate of 50% was used in the paper. The authors further develop this point to mention that the method can be improved by adapting individual dropout probabilities of each hidden or input unit using validation tests. This would be an interesting area to further develop and explore, as using a hardcoded 50% dropout for all layers might not be the optimal choice for all CNN applications. It would have been interesting to see the results of their investigations of differing dropout rates.<br />
<br />
The authors don't explain that during training, at each layer that we apply dropout, the values must be scaled by 1/p where p is dropout rate - this way the expected value of the layers is the same in both train and test time. They may have considered another solution for this discrepancy at the time (it is an old paper) but it doesn't seem like any solution was presented here. <br />
<br />
Despite the advantages of using dropout to prevent overfitting and reducing errors in testing, the authors did not discuss much about the effects on the length of training time. In another [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf paper] published a few years later by the same authors, there was more discussion about this. It appears that dropout increases training time by 2-3 times compared to a standard NN with the same architecture, which is a drawback that might be worth mentioning.<br />
<br />
Dropout layers prevent overfitting by randomly dropout a fraction of the neurons specified in each layer. In fact, the neurons to be dropped out in each layer are randomly selected. Therefore, it might be the case that some important features in the dropout layer are discarded, which leads to a sudden drop in performance. Although this barely happens, and CNN with dropout rates roughly 50% in each layer will lead to generally good performance, some future improvements are still possible if we are able to select dropout neurons cleverly.<br />
<br />
The article does a good job of analyzing the benefit of using the standard dropout method, but I think it would be beneficial to take a look at other dropout variants. For example, the paper may have benefited at looking at DropConnect which was introduced y L. Wan et al and is similar to dropout layers but it does not apply a dropout directly o the neurons but on the weights and the bias linking the neurons. Others that they also could have looked at were Standout, Pooling Drop and MaxDrop. Comparing various dropout methods I think would greatly add to the paper.<br />
<br />
The author analyzed the dropout method for addressing overfitting problems. The key idea is to randomly drop units from the neural network during training. This prevents units from co-adapting too much. In addition, it also fastens the speed of training models since there are fewer neurons, which is a good idea. <br />
<br />
Random dropping was indeed quite effective in the MNIST fashion classification challenge, however it may pose a question if the problem has very few features to begin with.<br />
<br />
The authors mentioned that they used Momentum to speed up the training but didn't show the alternative and the speed of the alternative. This [https://link.springer.com/article/10.1007/s11042-019-08453-9 paper]conducts an empirical study of Dropout vs Batch Normalization as well as compares different optimizers (like SGD which uses momentum) for each technique. It is found that optimizers with momentum out perform adaptive optimizers but at a cost of significantly longer training times.<br />
<br />
Dropout is a very popular technique to improve accuracy by avoiding overfitting. It might be interesting to see how it compares to other techniques and how the combination of techniques works.<br />
<br />
== Other Work ==<br />
<br />
In modern training, dropout is not advised for convolutional neural networks because it does not have the effect, interpretation, impact on spatial feature maps as dense features. This is because features in CNNs are spatially correlated. There is an interesting paper on DropBlock [2], a dropout method which drops entire contiguous regions of features, which has been shown to be much more effective for CNNs.<br />
<br />
== Reference ==<br />
[1] N. Srivastave, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Jan 2014.<br />
<br />
[2] Ghiasi, Golnaz and Lin, Tsung-Yi and Le, Quoc V. "DropBlock: A regularization method for convolutional networks". NeurIPS, 2018.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining&diff=49616Describtion of Text Mining2020-12-06T23:49:12Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid<br />
<br />
== Introduction ==<br />
This paper focuses on the different text mining techniques and the applications of text mining in the healthcare and biomedical domain. The text mining field has been popular as a result of the amount of text data that is available in different forms. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. Text is unstructured information, which is easy for humans to construct and understand but difficult for machines. Hence, there is a need to design algorithms to effectively process this avalanche of text. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods. <br />
<br />
The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.<br />
<br />
==Text Representation and Encoding ==<br />
The authors review multiple methods of preprocessing text, including 4 methods to preprocess and recognize influence and frequency of individual group of words in a document. In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.<br />
<br />
'''1. Tokenization'''<br />
<br />
This process splits text (i.e. a sentence) into a single unit of words, known as tokens while removing unnecessary characters. Tokenization relies on identifying word boundaries, that is ending of a word and beginning of another word, usually separated by space. Characters such as punctuation are removed and the text is split at space characters. An example of this would be converting the string "This is my string" to "This", "is", "my", "string".<br />
<br />
'''2. Filtering'''<br />
<br />
Filtering is a process by which unnecessary words or characters are removed. Often these include punctuation, prepositions, and conjugations. The resulting corpus then contains words with maximal importance in distinguishing between classes.<br />
<br />
'''3. Lemmatization'''<br />
<br />
Lemmatization is a task where the various inflected forms of a word are converted to a single form. However, unlike in stemming (see below), we must specify the part of speech (POS) of each word, i.e its intended meaning in the given sentence or document, which can prone to human error. For example, "geese" and "goose" have the same lemma "goose", as they have the same meaning.<br />
<br />
'''4. Stemming'''<br />
<br />
Stemming extracts the roots of words. It is a language dependent process. The goal of both stemming is to reduce inflectional and related (definition wise) forms of a word to a common base form. An example of this would be changing "am", "are", or "is" to "be".<br />
<br />
'''Vector Space Model'''<br />
In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms.<br />
In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. <br />
The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. <br />
<br />
The weights have 2 main models used Boolean model and TF-IDF model: <br />
'''Boolean model'''<br />
terms are assigned with a positive wij if the term appears in the document. otherwise, it will be assigned a weight of 0. <br />
<br />
'''Term Frequency - inverse document frequency (TF-IDF)'''<br />
The words are weighted using the TF-IDF scheme computed as <br />
<br />
$$<br />
q(w)=f_d(w)*\log{\frac{|D|}{f_D(w)}}<br />
$$<br />
<br />
The frequency of each term is normalized by the inverse of document frequency, which helps distinct words with low frequency is recognized its importance. Each document is represented by a vector of term weights, <math>\omega(d) = (\omega(d, w_1), \omega(d,w_2),...,\omega(d,w_v))</math>. The similarity between two documents <math>d_1, d_2</math> is commonly measured by cosine similarity:<br />
$$<br />
S(d_1,d_2) = \cos(\theta) = \frac{d_1\cdot d_2}{\sum_{i=1}^vw^2_{1i}\cdot\sum_{i=1}^vw^2_{2i}}<br />
$$<br />
<br />
== Classification ==<br />
Classification in Text Mining aims to assign predefined classes to text documents. For a set <math>\mathcal{D} = {d_1, d_2, ... d_n}</math> of documents, each <math>d_i</math> is mapped to a label <math>l_i</math> from the set <math>\mathcal{L} = {l_1, l_2, ... l_k}</math>. The goal is to find a classification model <math>f</math> such that: <math>\\</math><br />
$$<br />
f: \mathcal{D} \rightarrow \mathcal{L} \quad \quad \quad f(\mathcal{d}) = \mathcal{l}<br />
$$<br />
The author illustrates 4 different classifiers that are commonly used in text mining.<br />
<br />
<br />
'''1. Naive Bayes Classifier''' <br />
<br />
Bayes rule is used to classify new examples and select the class that has the generated result that occurs most often. <br />
Naive Bayes Classifier models the distribution of documents in each class using a probabilistic model assuming that the distribution<br />
of different terms is independent of each other. The models commonly used in this classifier tried to find the posterior probability of a class based on the distribution and assumes that the documents generated are based on a mixture model parameterized by <math>\theta</math> and compute the likelihood of a document using the sum of probabilities over all mixture component. In addition, the Naive Bayes Classifier can help get around the curse of dimensionality, which may arise with high-dimensional data, such as text. <br />
<br />
'''2. Nearest Neighbour Classifier'''<br />
<br />
Nearest Neighbour Classifier uses distance-based measures to perform the classification. The documents which belong to the same class are more likely "similar" or close to each other based on the similarity measure. The classification of the test documents is inferred from the class labels of similar documents in the training set. K-Nearest Neighbor classification is well known to suffer from the "curse of dimensionality", as the proportional volume of each $d$-sphere surrounding each datapoint compared to the volume of the sample space shrinks exponentially in $d$. <br />
<br />
'''3. Decision Tree Classifier'''<br />
<br />
A hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically. The decision tree recursively partitions the training data set into smaller subdivisions based on a set of tests defined at each node or branch. Each node of the tree is a test of some attribute of the training instance, and each branch descending from the node corresponds to one of the values of this attribute. The conditions on the nodes are commonly defined by the terms in the text documents.<br />
<br />
'''4. Support Vector Machines'''<br />
<br />
SVM is a form of Linear Classifiers which are models that makes a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to the <math> y=\vec{a} \cdot \vec{x} + b</math> where <math>\vec{x}</math> is the normalized document word frequency vector, <math>\vec{a}</math> is a vector of coefficient and <math>b</math> is a scalar. Support Vector Machines attempts to find a linear separators between various classes. An advantage of the SVM method is it is robust to high dimensionality.<br />
<br />
== Clustering ==<br />
Clustering has been extensively studied in the context of the text as it has a wide range of applications such as visualization and document organization.<br />
<br />
Clustering algorithms are used to group similar documents and thus aid in information retrieval. Text clustering can be in different levels of granularities, where clusters can be documents, paragraphs, sentences, or terms. Since text data has numerous distance characteristics that demand the design of text-specific algorithms for the task, using a binary vector to represent the text document is simply not enough. Here are some unique properties of text representation:<br />
<br />
1. Text representation has a large dimensionality, in which the size of the vocabulary from which the documents are drawn is massive, but a document might only contain a small number of words.<br />
<br />
2. The words in the documents are usually correlated with each other. Need to take the correlation into consideration when designing algorithms.<br />
<br />
3. The number of words differs from one another of the document. Thus the document needs to be normalized first before the clustering process.<br />
<br />
Three most commonly used text clustering algorithms are presented below.<br />
<br />
<br />
'''1. Hierarchical Clustering algorithms''' <br />
<br />
Hierarchical Clustering algorithms builds a group of clusters that can be depicted as a hierarchy of clusters. The hierarchy can be constructed in top-down (divisive) or bottom-up (agglomeration). Hierarchical clustering algorithms are one of the Distanced-based clustering algorithms, i.e., using a similarity function to measure the closeness between text documents.<br />
<br />
In the top-down approach, the algorithm begins with one cluster which includes all the documents. we recursively split this cluster into sub-clusters.<br />
Here is an example of a Hierarchical Clustering algorithm, the data is to be clustered by the euclidean distance. This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step determines which elements to merge in a cluster by taking the two closest elements, according to the chosen distance.<br />
<br />
<br />
[[File:418px-Hierarchical clustering simple diagram.svg.png| 300px | center]]<br />
<br />
<br />
<div align="center">Figure 1: Hierarchical Clustering Raw Data</div><br />
<br />
<br />
<br />
[[File:250px-Clusters.svg (1).png| 200px | center]]<br />
<br />
<br />
<div align="center">Figure 2: Hierarchical Clustering Clustered Data</div><br />
<br />
A main advantage of hierarchical clustering is that the algorithm only needs to be done once for any number of clusters (ie. if an individual wishes to use a different number of clusters than originally intended, they do not need to repeat the algorithm)<br />
<br />
'''2. k-means Clustering'''<br />
<br />
k-means clustering is a partitioning algorithm that partitions n documents in the context of text data into k clusters.<br />
<br />
Input: Document D, similarity measure S, number k of cluster<br />
Output: Set of k clusters<br />
Select randomly ''k'' datapoints as starting centroids<br />
While ''not converged'' do <br />
Assign documents to the centroids based on the closest similarity<br />
Calculate the cluster centroids for all clusters<br />
return ''k clusters''<br />
<br />
The main disadvantage of k-means clustering is that it is indeed very sensitive to the initial choice of the number of k. Also, since the function is run until clusters converges, k-means clustering tends to take longer to perform than hierarchical clustering. On the other hand, advantages of k-means clustering are that it is simple to implement, the algorithm scales well to large datasets, and the results are easily interpretable.<br />
<br />
<br />
'''3. Probabilistic Clustering and Topic Models'''<br />
<br />
Topic modeling is one of the most popular probabilistic clustering algorithms in recent studies. The main idea is to create a *probabilistic generative model* for the corpus of text documents. In topic models, documents are a mixture of topics, where each topic represents a probability distribution over words.<br />
<br />
There are two main topic models:<br />
* Probabilistic Latent Semantic Analysis (pLSA)<br />
* Latent Dirichlet Allocation (LDA)<br />
<br />
The paper covers LDA in more detail. LDA is a state-of-the-art unsupervised algorithm for extracting topics from a collection of documents.<br />
<br />
Given <math>\mathcal{D} = \{d_1, d_2, \cdots, d_{|\mathcal{D}|}\}</math> is the corpus and <math>\mathcal{V} = \{w_1, w_2, \cdots, w_{|\mathcal{V}|}\}</math> is the vocabulary of the corpus. <br />
<br />
A topic is <math>z_j, 1 \leq j \leq K</math> is a multinomial probability distribution over <math>|\mathcal{V}|</math> words. <br />
<br />
The distribution of words in a given document is:<br />
<br />
<math>p(w_i|d) = \Sigma_{j=1}^K p(w_i|z_j)p(z_j|d)</math><br />
<br />
The LDA assumes the following generative process for the corpus of <math>\mathcal{D}</math><br />
* For each topic <math>k\in \{1,2,\cdots, K\}</math>, sample a word distribution <math>\phi_k \sim Dir(\beta)</math><br />
* For each document <math>d \in \{1,2,\cdots,D\}</math><br />
** Sample a topic distribution <math>\theta_d \sim Dir(\alpha)</math><br />
** For each word <math>w_n, n \in \{1,2,\cdots,N\}</math> in document <math>d</math><br />
*** Sample a topic <math>z_i \sim Mult(\theta_d)</math><br />
*** Sample a word <math>w_n \sim Mult(\phi_{z_i})</math><br />
<br />
In practice, LDA is often used as a module in more complicated models and has already been applied to a wide variety of domains. In addition, many variations of LDA has been created, including supervised LDA (sLDA) and hierarchical LDA (hLDA)<br />
<br />
== Information Extraction ==<br />
Information Extraction (IE) is the process of extracting useful, structured information from unstructured or semi-structured text. It automatically extracts based on our command. <br />
<br />
For example, from the sentence “XYZ company was founded by Peter in the year of 1950”, we can identify the two following relations:<br />
<br />
1. Founderof(Peter, XYZ)<br />
<br />
2. Foundedin(1950, XYZ)<br />
<br />
IE is a crucial step in data mining and has a broad variety of applications, such as web mining and natural language processing. Among all the IE tasks, two have become increasingly important, which are name entity recognition and relation extraction.<br />
<br />
The author mentioned 4 parts that are important for Information Extraction<br />
<br />
'''1. Named Entity Recognition(NER)'''<br />
<br />
This is the process of identifying real-world entity from free text, such as "Apple Inc.", "Donald Trump", "PlayStation 5" etc. Moreover, the task is to identify the category of these entities, such as "Apple Inc." is in the category of the company, "Donald Trump" is in the category of the USA president, and "PlayStation 5" is in the category of the entertainment system. <br />
<br />
'''2. Hidden Markov Model'''<br />
<br />
Since traditional probabilistic classification does not consider the predicted labels of neighbor words, we use the Hidden Markov Model when doing Information Extraction. This model is different because it considers that the label of one word depends on the previous words that appeared. The Hidden Markov model allows us to model the situation, given a sequence of labels <math> Y= (y_1, y_2, \cdots, y_n) </math>and sequence of observations <math> X= (x_1, x_2, \cdots, x_n) </math> we get<br />
<br />
<center><br />
<math><br />
y_i \sim p(y_i|y_{i-1}) \qquad x_i \sim p(x_i|x_{i-1})<br />
</math><br />
</center><br />
<br />
'''3. Conditional Random Fields'''<br />
<br />
This is a technique that is widely used in Information Extraction. The definition of it is related to graph theory. <br />
let G = (V, E) be a graph and Yv stands for the index of the vertices in G. Then (X, Y) is a conditional random field, when the random variables Yv, conditioned on X, obey Markov property with respect to the graph, and:<br />
<math>p(Y_v |X, Y_w ,w , v) = p(Y_v |X, Y_w ,w ∼ v)</math>, where w ∼ v means w and v are neighbors in G.<br />
<br />
'''4. Relation Extraction'''<br />
<br />
This is a task of finding semantic relationships between word entities in text documents, for example in a sentence such as "Seth Curry is the brother of Stephen Curry". If there is a document including these two names, the task is to identify the relationship of these two entities. There are currently numerous techniques to perform relation extraction, but the most common is to consider it a classification problem. The problem is structured as, given two entities in that occur in a sentence classify their relation into fixed relation types.<br />
<br />
== Biomedical Application ==<br />
<br />
Text mining has several applications in the domain of biomedical sciences. The explosion of academic literature in the field has made it quite hard for scientists to keep up with novel research. This is why text mining techniques are ever so important in making the knowledge digestible.<br />
<br />
The text mining techniques are able to extract meaningful information from large data by making use of biomedical ontology, which is a compilation of a common set of terms used in an area of knowledge. The Unified Medical Language System (UMLS) is the most comprehensive such resource, consisting of definitions of biomedical jargon. Several information extraction algorithms rely on the ontology to perform tasks such as Named Entity Recognition (NER) and Relation Extraction.<br />
<br />
NER involves locating and classifying biomedical entities into meaningful categories and assigning semantic representation to those entities. The NER methods can be broadly grouped into Dictionary-based, Rule-based, and Statistical approaches. NER tasks are challenging in the biomedical domain due to three key reasons: (1) There is a continuously growing volume of semantically related entities in the biomedical domain due to continuous scientific progress, so NER systems depend on dictionaries of terms which can never be complete; (2) There are often numerous names for the same concept in the biomedical domain, such as "heart attack" and "myocardial infarction"; and (3) Acronyms and abbreviations are frequently used which makes it complicated to identify the concepts these terms express. Note that Dictionary-based approaches are therefore reserved for the most advanced NER methods. <br />
<br />
Relation extraction, on the other hand, is the process of determining relationships between the entities. This is accomplished mainly by identifying the correlation between entities through analyzing the frequency of terms, as well as rules defined by domain experts. Moreover, modern algorithms are also able to summarize large documents and answer natural language questions posed by humans.<br />
<br />
Summarization is a common biomedical text mining task that largely utilizes information extraction tasks. The idea is the automatically identify significant aspects of documents and represent them in a coherent fashion. However, evaluating summarization methods becomes very difficult since deciding whether a summary is "good" is often subjective, although there are some automatic evaluation techniques for summaries such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which compares automatically generated summaries with those created by humans.<br />
<br />
== Conclusion ==<br />
<br />
This paper gave a holistic overview of the methods and applications of text mining, particularly its relevance in the biomedical domain. It highlights several popular algorithms and summarizes them along with their advantages, limitations and some potential situations where they could be used. Because of ever-growing data, for example, the very high volume of scientific literature being produced every year, the interest in this field is massive and is bound to grow in the future.<br />
<br />
== Critiques==<br />
<br />
This is a very detailed approach to introduce some different algorithms on text mining. Since many algorithms are given, it might be a good idea to compare their performances on text mining by training them on some text data and compare them to the former baselines, to see if there exists any improvement.<br />
<br />
it is a detailed summary of the techniques used in text mining. It would be more helpful if some dataset can be included for training and testing. The algorithms were grouped by different topics so that different datasets and measurements are required.<br />
<br />
It would be better for the paper to include test accuracy for testing and training sets to support text mining is a more efficient and effective algorithm compared to other techniques. Moreover, this paper mentioned Text Mining approach can be used to extract high-quality information from videos. It is to believe that extracting from videos is much more difficult than images and texts. How is it possible to retain its test accuracy at a good level for videos?<br />
<br />
Text mining can no only impact the organizational processes, but also the ability to be competitive. Some common examples of the applications are risk management, cybercrime prevention, customer care service and contextual advertising/<br />
<br />
Preprocessing an important step to analyze text, so it might be better to have the more details about that. For example, what types of words are usually removed and show we record the relative position of each word in the sentence. If one close related sentences were split into two sentences, how can we capture their relations?<br />
<br />
The authors could give more details on the applications of text mining in the healthcare and biomedical domain. For example, how could preprocessing, classification, clustering, and information extraction process be applied to this domain. Other than introduction of existing algorithms (e.g. NER), authors can provide more information about how they performs (with a sample dataset), what are their limitations, and comparisons among different algorithms.<br />
<br />
In the preprocessing section, it seems like the authors incorrectly describe what stemming is - stemming just removes the last few letters of a word (ex. studying -> study, studies -> studi). What the authors actually describe is lemmatization which is much more informative than stemming. The down side of lemmatization is that it takes more effort to build a lemmatizer than a stemmer and even once it is built it is slow in comparison with a stemmer.<br />
<br />
One of the challenges of text mining in the biomedical field is that a lot of patient data are still in the form of paper documents. Text mining can speed up the digitization of patient data and allow for the development of disease diagnosis algorithms. It'll be interesting to see how text mining can be integrated with healthcare AI such as the doppelganger algorithm to enhance question answering accuracy. (Cresswell et al, 2018)<br />
<br />
It might be helpful if the authors discuss more about the accuracy-wise performances of some text mining techniques, especially in the healthcare and biomedical domain, given the focus. It would be interesting if more information were provided about the level of accuracy needed in order to produce reliable and actionable information in such fields. Also, in these domains, sometimes a false negative could be more harmful than a false positive, such as a clinical misdiagnosis. It might be helpful to discuss a bit more about how to combats such issues in text mining.<br />
<br />
This is a survey paper that talks about many general aspects about text mining, without going into any specific one in detail. Overall it's interesting. My first feedback is on the "Information Retrieval" section of the paper. Hidden markov model is mentioned as one of the algorithms used for IR. Yet, hidden markov makes the strong assumption that given the current state, next state is independent of all the previous states. This is a very strong assumption to make in IR, as words in a sentence usually have a very strong connection to each other. This limitation should be discussed more extensively in the paper. Also, the overall structure of the paper seems to be a bit imbalanced. It solely talks about IR's application in biomedical sciences. Yet, IR has application in many different areas and subjects.<br />
<br />
This paper surveys through multiple methods and algorithms on test mining, more specifically, information extraction, test classification, and clustering. In the Information Extraction section, four possible methods are mentioned to deal with different examples of semantic texts. In the latest studies of machine learning, it is ubiquitous to see multiple methods or algorithms are combined together to achieve better performances. For a survey paper, it will be more interesting to see some connections between the four methods, and some insights such as how we can boost the accuracy of extracting precise information by combining 2 of the 4 methods together.<br />
<br />
It would be better discuss more applications and SoTA algorithms on each tasks. It just give an application in biomedical with NER, it is too simple.<br />
<br />
The summary is well-organized and gives enough information to first-time readers about text mining and different algorithms to model the data and predict using different classifiers. However, it would be better to add comparison between each classifier since the performance is important to know.<br />
<br />
This is a great informational summary, I don't have much critiques to give. But, I wanted to point out that many modern techniques ignore so many of these interesting data transformations and preprocessing steps, since the text in its raw form provides the most information for deep models to extract features from. Specifically, we can look at ULM-Fit (https://arxiv.org/abs/1801.06146) and BERT (https://arxiv.org/abs/1810.04805) and observe very little text preprocessing outside of tokenization, and simply allowing the model to learn the necessary features from a huge corpus.<br />
<br />
It might be better to explain more about Knowledge Discovery and Data Mining in the Introduction part, such as giving the definition and the comparison between them, so that the audience can understand text mining clearer.<br />
<br />
The paper and corresponding summary seems to be more breadth-focused and extremely high-level. I think this paper could've been taken a step further by including applications of the various algorithms. For example, the task of topic modelling which is highly customizable, and preprocessing which is dependent on the domain, can be accomplished using many approaches: transforming the processed texts to feature vectors using BERT or pre-trained Word2Vec; and then applying unsupervised learning methods such as LDA and clustering. Within the biomedical application mentioned, it might be of interest to look into BioBERT, which is trained on more domain-specific texts (https://koreauniv.pure.elsevier.com/en/publications/biobert-a-pre-trained-biomedical-language-representation-model-fo). <br />
<br />
The paper summary is great as it describes very important topic in today's world and how technology must adapt to the vast increase in data creation. Particularly, it is really cool to see how machine learning is used in a multi-disciplinary manner.<br />
<br />
Text-mining from this paper is described as a very compute-intensive process. Due to "a 50 times growth since 2010" to 2020, it would be nice to have a method of scaling the data in this domain to be more prepared for an even bigger growth of data in the next decade. Thus it would have been nice if the researchers included performance metrics (both computation and classification performances) of the system with different classifiers described in this paper. Lastly, it would be nice to see comparisons of ROUGE metrics for summarization that the researchers were able to achieve using the Text Mining technique they introduced.<br />
<br />
The topic is detailed. There are many factors that could affect the results of text mining since the writing habit in terms of vocabulary and the way a person construct his/her sentence could be quite different. Also, since now is the Internet era, there are always new words invented from the Internet, and some of them might be recorded in the dictionary as a casual vocabulary. A good machine learning model should be able to keep learning new words as human and correctly distinguish text through the literature structures like metaphors,slangs and folklore.<br />
<br />
This is a very detailed survey paper. It described many classical models related to text mining. However, it might be good to at least mention some new models for further reading to be a direction of readers who are new to this area.<br />
<br />
== References ==<br />
<br />
[1] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.<br />
<br />
[2] Cresswell, Kathrin & Cunningham-Burley, Sarah & Sheikh, Aziz. (2018). Healthcare robotics - a qualitative exploration of key challenges and future directions (Preprint). Journal of Medical Internet Research. 20. 10.2196/10410.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining&diff=49614Describtion of Text Mining2020-12-06T23:43:38Z<p>Y2587wan: /* Text Representation and Encoding */</p>
<hr />
<div>== Presented by == <br />
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid<br />
<br />
== Introduction ==<br />
This paper focuses on the different text mining techniques and the applications of text mining in the healthcare and biomedical domain. The text mining field has been popular as a result of the amount of text data that is available in different forms. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. Text is unstructured information, which is easy for humans to construct and understand but difficult for machines. Hence, there is a need to design algorithms to effectively process this avalanche of text. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods. <br />
<br />
The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.<br />
<br />
==Text Representation and Encoding ==<br />
The authors review multiple methods of preprocessing text, including 4 methods to preprocess and recognize influence and frequency of individual group of words in a document. In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.<br />
<br />
'''1. Tokenization'''<br />
<br />
This process splits text (i.e. a sentence) into a single unit of words, known as tokens while removing unnecessary characters. Tokenization relies on identifying word boundaries, that is ending of a word and beginning of another word, usually separated by space. Characters such as punctuation are removed and the text is split at space characters. An example of this would be converting the string "This is my string" to "This", "is", "my", "string".<br />
<br />
'''2. Filtering'''<br />
<br />
Filtering is a process by which unnecessary words or characters are removed. Often these include punctuation, prepositions, and conjugations. The resulting corpus then contains words with maximal importance in distinguishing between classes.<br />
<br />
'''3. Lemmatization'''<br />
<br />
Lemmatization is a task where the various inflected forms of a word are converted to a single form. However, unlike in stemming (see below), we must specify the part of speech (POS) of each word, i.e its intended meaning in the given sentence or document, which can prone to human error. For example, "geese" and "goose" have the same lemma "goose", as they have the same meaning.<br />
<br />
'''4. Stemming'''<br />
<br />
Stemming extracts the roots of words. It is a language dependent process. The goal of both stemming is to reduce inflectional and related (definition wise) forms of a word to a common base form. An example of this would be changing "am", "are", or "is" to "be".<br />
<br />
'''Vector Space Model'''<br />
In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms.<br />
In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. <br />
The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. <br />
<br />
The weights have 2 main models used Boolean model and TF-IDF model: <br />
'''Boolean model'''<br />
terms are assigned with a positive wij if the term appears in the document. otherwise, it will be assigned a weight of 0. <br />
<br />
'''Term Frequency - inverse document frequency (TF-IDF)'''<br />
The words are weighted using the TF-IDF scheme computed as <br />
<br />
$$<br />
q(w)=f_d(w)*\log{\frac{|D|}{f_D(w)}}<br />
$$<br />
<br />
The frequency of each term is normalized by the inverse of document frequency, which helps distinct words with low frequency is recognized its importance. Each document is represented by a vector of term weights, <math>\omega(d) = (\omega(d, w_1), \omega(d,w_2),...,\omega(d,w_v))</math>. The similarity between two documents <math>d_1, d_2</math> is commonly measured by cosine similarity:<br />
$$<br />
S(d_1,d_2) = \cos(\theta) = \frac{d_1\cdot d_2}{\sum_{i=1}^vw^2_{1i}\cdot\sum_{i=1}^vw^2_{2i}}<br />
$$<br />
<br />
== Classification ==<br />
Classification in Text Mining aims to assign predefined classes to text documents. For a set <math>\mathcal{D} = {d_1, d_2, ... d_n}</math> of documents, each <math>d_i</math> is mapped to a label <math>l_i</math> from the set <math>\mathcal{L} = {l_1, l_2, ... l_k}</math>. The goal is to find a classification model <math>f</math> such that: <math>\\</math><br />
$$<br />
f: \mathcal{D} \rightarrow \mathcal{L} \quad \quad \quad f(\mathcal{d}) = \mathcal{l}<br />
$$<br />
The author illustrates 4 different classifiers that are commonly used in text mining.<br />
<br />
<br />
'''1. Naive Bayes Classifier''' <br />
<br />
Bayes rule is used to classify new examples and select the class that has the generated result that occurs most often. <br />
Naive Bayes Classifier models the distribution of documents in each class using a probabilistic model assuming that the distribution<br />
of different terms is independent of each other. The models commonly used in this classifier tried to find the posterior probability of a class based on the distribution and assumes that the documents generated are based on a mixture model parameterized by <math>\theta</math> and compute the likelihood of a document using the sum of probabilities over all mixture component. In addition, the Naive Bayes Classifier can help get around the curse of dimensionality, which may arise with high-dimensional data, such as text. <br />
<br />
'''2. Nearest Neighbour Classifier'''<br />
<br />
Nearest Neighbour Classifier uses distance-based measures to perform the classification. The documents which belong to the same class are more likely "similar" or close to each other based on the similarity measure. The classification of the test documents is inferred from the class labels of similar documents in the training set. K-Nearest Neighbor classification is well known to suffer from the "curse of dimensionality", as the proportional volume of each $d$-sphere surrounding each datapoint compared to the volume of the sample space shrinks exponentially in $d$. <br />
<br />
'''3. Decision Tree Classifier'''<br />
<br />
A hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically. The decision tree recursively partitions the training data set into smaller subdivisions based on a set of tests defined at each node or branch. Each node of the tree is a test of some attribute of the training instance, and each branch descending from the node corresponds to one of the values of this attribute. The conditions on the nodes are commonly defined by the terms in the text documents.<br />
<br />
'''4. Support Vector Machines'''<br />
<br />
SVM is a form of Linear Classifiers which are models that makes a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to the <math> y=\vec{a} \cdot \vec{x} + b</math> where <math>\vec{x}</math> is the normalized document word frequency vector, <math>\vec{a}</math> is a vector of coefficient and <math>b</math> is a scalar. Support Vector Machines attempts to find a linear separators between various classes. An advantage of the SVM method is it is robust to high dimensionality.<br />
<br />
== Clustering ==<br />
Clustering has been extensively studied in the context of the text as it has a wide range of applications such as visualization and document organization.<br />
<br />
Clustering algorithms are used to group similar documents and thus aid in information retrieval. Text clustering can be in different levels of granularities, where clusters can be documents, paragraphs, sentences, or terms. Since text data has numerous distance characteristics that demand the design of text-specific algorithms for the task, using a binary vector to represent the text document is simply not enough. Here are some unique properties of text representation:<br />
<br />
1. Text representation has a large dimensionality, in which the size of the vocabulary from which the documents are drawn is massive, but a document might only contain a small number of words.<br />
<br />
2. The words in the documents are usually correlated with each other. Need to take the correlation into consideration when designing algorithms.<br />
<br />
3. The number of words differs from one another of the document. Thus the document needs to be normalized first before the clustering process.<br />
<br />
Three most commonly used text clustering algorithms are presented below.<br />
<br />
<br />
'''1. Hierarchical Clustering algorithms''' <br />
<br />
Hierarchical Clustering algorithms builds a group of clusters that can be depicted as a hierarchy of clusters. The hierarchy can be constructed in top-down (divisive) or bottom-up (agglomeration). Hierarchical clustering algorithms are one of the Distanced-based clustering algorithms, i.e., using a similarity function to measure the closeness between text documents.<br />
<br />
In the top-down approach, the algorithm begins with one cluster which includes all the documents. we recursively split this cluster into sub-clusters.<br />
Here is an example of a Hierarchical Clustering algorithm, the data is to be clustered by the euclidean distance. This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step determines which elements to merge in a cluster by taking the two closest elements, according to the chosen distance.<br />
<br />
<br />
[[File:418px-Hierarchical clustering simple diagram.svg.png| 300px | center]]<br />
<br />
<br />
<div align="center">Figure 1: Hierarchical Clustering Raw Data</div><br />
<br />
<br />
<br />
[[File:250px-Clusters.svg (1).png| 200px | center]]<br />
<br />
<br />
<div align="center">Figure 2: Hierarchical Clustering Clustered Data</div><br />
<br />
A main advantage of hierarchical clustering is that the algorithm only needs to be done once for any number of clusters (ie. if an individual wishes to use a different number of clusters than originally intended, they do not need to repeat the algorithm)<br />
<br />
'''2. k-means Clustering'''<br />
<br />
k-means clustering is a partitioning algorithm that partitions n documents in the context of text data into k clusters.<br />
<br />
Input: Document D, similarity measure S, number k of cluster<br />
Output: Set of k clusters<br />
Select randomly ''k'' datapoints as starting centroids<br />
While ''not converged'' do <br />
Assign documents to the centroids based on the closest similarity<br />
Calculate the cluster centroids for all clusters<br />
return ''k clusters''<br />
<br />
The main disadvantage of k-means clustering is that it is indeed very sensitive to the initial choice of the number of k. Also, since the function is run until clusters converges, k-means clustering tends to take longer to perform than hierarchical clustering. On the other hand, advantages of k-means clustering are that it is simple to implement, the algorithm scales well to large datasets, and the results are easily interpretable.<br />
<br />
<br />
'''3. Probabilistic Clustering and Topic Models'''<br />
<br />
Topic modeling is one of the most popular probabilistic clustering algorithms in recent studies. The main idea is to create a *probabilistic generative model* for the corpus of text documents. In topic models, documents are a mixture of topics, where each topic represents a probability distribution over words.<br />
<br />
There are two main topic models:<br />
* Probabilistic Latent Semantic Analysis (pLSA)<br />
* Latent Dirichlet Allocation (LDA)<br />
<br />
The paper covers LDA in more detail. LDA is a state-of-the-art unsupervised algorithm for extracting topics from a collection of documents.<br />
<br />
Given <math>\mathcal{D} = \{d_1, d_2, \cdots, d_{|\mathcal{D}|}\}</math> is the corpus and <math>\mathcal{V} = \{w_1, w_2, \cdots, w_{|\mathcal{V}|}\}</math> is the vocabulary of the corpus. <br />
<br />
A topic is <math>z_j, 1 \leq j \leq K</math> is a multinomial probability distribution over <math>|\mathcal{V}|</math> words. <br />
<br />
The distribution of words in a given document is:<br />
<br />
<math>p(w_i|d) = \Sigma_{j=1}^K p(w_i|z_j)p(z_j|d)</math><br />
<br />
The LDA assumes the following generative process for the corpus of <math>\mathcal{D}</math><br />
* For each topic <math>k\in \{1,2,\cdots, K\}</math>, sample a word distribution <math>\phi_k \sim Dir(\beta)</math><br />
* For each document <math>d \in \{1,2,\cdots,D\}</math><br />
** Sample a topic distribution <math>\theta_d \sim Dir(\alpha)</math><br />
** For each word <math>w_n, n \in \{1,2,\cdots,N\}</math> in document <math>d</math><br />
*** Sample a topic <math>z_i \sim Mult(\theta_d)</math><br />
*** Sample a word <math>w_n \sim Mult(\phi_{z_i})</math><br />
<br />
In practice, LDA is often used as a module in more complicated models and has already been applied to a wide variety of domains. In addition, many variations of LDA has been created, including supervised LDA (sLDA) and hierarchical LDA (hLDA)<br />
<br />
== Information Extraction ==<br />
Information Extraction (IE) is the process of extracting useful, structured information from unstructured or semi-structured text. It automatically extracts based on our command. <br />
<br />
For example, from the sentence “XYZ company was founded by Peter in the year of 1950”, we can identify the two following relations:<br />
<br />
1. Founderof(Peter, XYZ)<br />
<br />
2. Foundedin(1950, XYZ)<br />
<br />
IE is a crucial step in data mining and has a broad variety of applications, such as web mining and natural language processing. Among all the IE tasks, two have become increasingly important, which are name entity recognition and relation extraction.<br />
<br />
The author mentioned 4 parts that are important for Information Extraction<br />
<br />
'''1. Named Entity Recognition(NER)'''<br />
<br />
This is the process of identifying real-world entity from free text, such as "Apple Inc.", "Donald Trump", "PlayStation 5" etc. Moreover, the task is to identify the category of these entities, such as "Apple Inc." is in the category of the company, "Donald Trump" is in the category of the USA president, and "PlayStation 5" is in the category of the entertainment system. <br />
<br />
'''2. Hidden Markov Model'''<br />
<br />
Since traditional probabilistic classification does not consider the predicted labels of neighbor words, we use the Hidden Markov Model when doing Information Extraction. This model is different because it considers that the label of one word depends on the previous words that appeared. The Hidden Markov model allows us to model the situation, given a sequence of labels <math> Y= (y_1, y_2, \cdots, y_n) </math>and sequence of observations <math> X= (x_1, x_2, \cdots, x_n) </math> we get<br />
<br />
<center><br />
<math><br />
y_i \sim p(y_i|y_{i-1}) \qquad x_i \sim p(x_i|x_{i-1})<br />
</math><br />
</center><br />
<br />
'''3. Conditional Random Fields'''<br />
<br />
This is a technique that is widely used in Information Extraction. The definition of it is related to graph theory. <br />
let G = (V, E) be a graph and Yv stands for the index of the vertices in G. Then (X, Y) is a conditional random field, when the random variables Yv, conditioned on X, obey Markov property with respect to the graph, and:<br />
<math>p(Y_v |X, Y_w ,w , v) = p(Y_v |X, Y_w ,w ∼ v)</math>, where w ∼ v means w and v are neighbors in G.<br />
<br />
'''4. Relation Extraction'''<br />
<br />
This is a task of finding semantic relationships between word entities in text documents, for example in a sentence such as "Seth Curry is the brother of Stephen Curry". If there is a document including these two names, the task is to identify the relationship of these two entities. There are currently numerous techniques to perform relation extraction, but the most common is to consider it a classification problem. The problem is structured as, given two entities in that occur in a sentence classify their relation into fixed relation types.<br />
<br />
== Biomedical Application ==<br />
<br />
Text mining has several applications in the domain of biomedical sciences. The explosion of academic literature in the field has made it quite hard for scientists to keep up with novel research. This is why text mining techniques are ever so important in making the knowledge digestible.<br />
<br />
The text mining techniques are able to extract meaningful information from large data by making use of biomedical ontology, which is a compilation of a common set of terms used in an area of knowledge. The Unified Medical Language System (UMLS) is the most comprehensive such resource, consisting of definitions of biomedical jargon. Several information extraction algorithms rely on the ontology to perform tasks such as Named Entity Recognition (NER) and Relation Extraction.<br />
<br />
NER involves locating and classifying biomedical entities into meaningful categories and assigning semantic representation to those entities. The NER methods can be broadly grouped into Dictionary-based, Rule-based, and Statistical approaches. NER tasks are challenging in the biomedical domain due to three key reasons: (1) There is a continuously growing volume of semantically related entities in the biomedical domain due to continuous scientific progress, so NER systems depend on dictionaries of terms which can never be complete; (2) There are often numerous names for the same concept in the biomedical domain, such as "heart attack" and "myocardial infarction"; and (3) Acronyms and abbreviations are frequently used which makes it complicated to identify the concepts these terms express. Note that Dictionary-based approaches are therefore reserved for the most advanced NER methods. <br />
<br />
Relation extraction, on the other hand, is the process of determining relationships between the entities. This is accomplished mainly by identifying the correlation between entities through analyzing the frequency of terms, as well as rules defined by domain experts. Moreover, modern algorithms are also able to summarize large documents and answer natural language questions posed by humans.<br />
<br />
Summarization is a common biomedical text mining task that largely utilizes information extraction tasks. The idea is the automatically identify significant aspects of documents and represent them in a coherent fashion. However, evaluating summarization methods becomes very difficult since deciding whether a summary is "good" is often subjective, although there are some automatic evaluation techniques for summaries such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which compares automatically generated summaries with those created by humans.<br />
<br />
== Conclusion ==<br />
<br />
This paper gave a holistic overview of the methods and applications of text mining, particularly its relevance in the biomedical domain. It highlights several popular algorithms and summarizes them along with their advantages, limitations and some potential situations where they could be used. Because of ever-growing data, for example, the very high volume of scientific literature being produced every year, the interest in this field is massive and is bound to grow in the future.<br />
<br />
== Critiques==<br />
<br />
This is a very detailed approach to introduce some different algorithms on text mining. Since many algorithms are given, it might be a good idea to compare their performances on text mining by training them on some text data and compare them to the former baselines, to see if there exists any improvement.<br />
<br />
it is a detailed summary of the techniques used in text mining. It would be more helpful if some dataset can be included for training and testing. The algorithms were grouped by different topics so that different datasets and measurements are required.<br />
<br />
It would be better for the paper to include test accuracy for testing and training sets to support text mining is a more efficient and effective algorithm compared to other techniques. Moreover, this paper mentioned Text Mining approach can be used to extract high-quality information from videos. It is to believe that extracting from videos is much more difficult than images and texts. How is it possible to retain its test accuracy at a good level for videos?<br />
<br />
Text mining can no only impact the organizational processes, but also the ability to be competitive. Some common examples of the applications are risk management, cybercrime prevention, customer care service and contextual advertising/<br />
<br />
Preprocessing an important step to analyze text, so it might be better to have the more details about that. For example, what types of words are usually removed and show we record the relative position of each word in the sentence. If one close related sentences were split into two sentences, how can we capture their relations?<br />
<br />
The authors could give more details on the applications of text mining in the healthcare and biomedical domain. For example, how could preprocessing, classification, clustering, and information extraction process be applied to this domain. Other than introduction of existing algorithms (e.g. NER), authors can provide more information about how they performs (with a sample dataset), what are their limitations, and comparisons among different algorithms.<br />
<br />
In the preprocessing section, it seems like the authors incorrectly describe what stemming is - stemming just removes the last few letters of a word (ex. studying -> study, studies -> studi). What the authors actually describe is lemmatization which is much more informative than stemming. The down side of lemmatization is that it takes more effort to build a lemmatizer than a stemmer and even once it is built it is slow in comparison with a stemmer.<br />
<br />
One of the challenges of text mining in the biomedical field is that a lot of patient data are still in the form of paper documents. Text mining can speed up the digitization of patient data and allow for the development of disease diagnosis algorithms. It'll be interesting to see how text mining can be integrated with healthcare AI such as the doppelganger algorithm to enhance question answering accuracy. (Cresswell et al, 2018)<br />
<br />
It might be helpful if the authors discuss more about the accuracy-wise performances of some text mining techniques, especially in the healthcare and biomedical domain, given the focus. It would be interesting if more information were provided about the level of accuracy needed in order to produce reliable and actionable information in such fields. Also, in these domains, sometimes a false negative could be more harmful than a false positive, such as a clinical misdiagnosis. It might be helpful to discuss a bit more about how to combats such issues in text mining.<br />
<br />
This is a survey paper that talks about many general aspects about text mining, without going into any specific one in detail. Overall it's interesting. My first feedback is on the "Information Retrieval" section of the paper. Hidden markov model is mentioned as one of the algorithms used for IR. Yet, hidden markov makes the strong assumption that given the current state, next state is independent of all the previous states. This is a very strong assumption to make in IR, as words in a sentence usually have a very strong connection to each other. This limitation should be discussed more extensively in the paper. Also, the overall structure of the paper seems to be a bit imbalanced. It solely talks about IR's application in biomedical sciences. Yet, IR has application in many different areas and subjects.<br />
<br />
This paper surveys through multiple methods and algorithms on test mining, more specifically, information extraction, test classification, and clustering. In the Information Extraction section, four possible methods are mentioned to deal with different examples of semantic texts. In the latest studies of machine learning, it is ubiquitous to see multiple methods or algorithms are combined together to achieve better performances. For a survey paper, it will be more interesting to see some connections between the four methods, and some insights such as how we can boost the accuracy of extracting precise information by combining 2 of the 4 methods together.<br />
<br />
It would be better discuss more applications and SoTA algorithms on each tasks. It just give an application in biomedical with NER, it is too simple.<br />
<br />
The summary is well-organized and gives enough information to first-time readers about text mining and different algorithms to model the data and predict using different classifiers. However, it would be better to add comparison between each classifier since the performance is important to know.<br />
<br />
This is a great informational summary, I don't have much critiques to give. But, I wanted to point out that many modern techniques ignore so many of these interesting data transformations and preprocessing steps, since the text in its raw form provides the most information for deep models to extract features from. Specifically, we can look at ULM-Fit (https://arxiv.org/abs/1801.06146) and BERT (https://arxiv.org/abs/1810.04805) and observe very little text preprocessing outside of tokenization, and simply allowing the model to learn the necessary features from a huge corpus.<br />
<br />
It might be better to explain more about Knowledge Discovery and Data Mining in the Introduction part, such as giving the definition and the comparison between them, so that the audience can understand text mining clearer.<br />
<br />
The paper and corresponding summary seems to be more breadth-focused and extremely high-level. I think this paper could've been taken a step further by including applications of the various algorithms. For example, the task of topic modelling which is highly customizable, and preprocessing which is dependent on the domain, can be accomplished using many approaches: transforming the processed texts to feature vectors using BERT or pre-trained Word2Vec; and then applying unsupervised learning methods such as LDA and clustering. Within the biomedical application mentioned, it might be of interest to look into BioBERT, which is trained on more domain-specific texts (https://koreauniv.pure.elsevier.com/en/publications/biobert-a-pre-trained-biomedical-language-representation-model-fo). <br />
<br />
The paper summary is great as it describes very important topic in today's world and how technology must adapt to the vast increase in data creation. Particularly, it is really cool to see how machine learning is used in a multi-disciplinary manner.<br />
<br />
Text-mining from this paper is described as a very compute-intensive process. Due to "a 50 times growth since 2010" to 2020, it would be nice to have a method of scaling the data in this domain to be more prepared for an even bigger growth of data in the next decade. Thus it would have been nice if the researchers included performance metrics (both computation and classification performances) of the system with different classifiers described in this paper. Lastly, it would be nice to see comparisons of ROUGE metrics for summarization that the researchers were able to achieve using the Text Mining technique they introduced.<br />
<br />
The topic is detailed. There are many factors that could affect the results of text mining since the writing habit in terms of vocabulary and the way a person construct his/her sentence could be quite different. Also, since now is the Internet era, there are always new words invented from the Internet, and some of them might be recorded in the dictionary as a casual vocabulary. A good machine learning model should be able to keep learning new words as human and correctly distinguish text through the literature structures like metaphors,slangs and folklores.<br />
<br />
== References ==<br />
<br />
[1] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.<br />
<br />
[2] Cresswell, Kathrin & Cunningham-Burley, Sarah & Sheikh, Aziz. (2018). Healthcare robotics - a qualitative exploration of key challenges and future directions (Preprint). Journal of Medical Internet Research. 20. 10.2196/10410.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_System&diff=49612Research Papers Classification System2020-12-06T23:41:22Z<p>Y2587wan: /* Critique */</p>
<hr />
<div>= Presented by =<br />
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li<br />
<br />
= Introduction =<br />
With the increasing advance of computer science and information technology, there is an increasingly overwhelming number of papers that have been published. Because of the mass number of papers, it has become incredibly hard to find and categorize papers. This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File System (HDFS). The system can handle quantitatively complex research paper classification problems efficiently and accurately.<br />
<br />
===General Framework===<br />
<br />
The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts. <br />
<br />
[[ File:Systemflow.png |right|image on right| 400px]]<br />
<ol><li>Paper Crawling <br />
<p>Collects abstracts from research papers published during a given period</p></li><br />
<li>Preprocessing<br />
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li><br />
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol><br />
</p></li> <br />
<li>Topic Modelling<br />
<p> Use the LDA to group the keywords into topics</p><br />
</li><br />
<li>Paper Length Calculation<br />
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p><br />
</li><br />
<li>Word Frequency Calculation<br />
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p><br />
</li><br />
<li>Document Frequency Calculation<br />
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p><br />
</li><br />
<li>TF-IDF calculation<br />
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p><br />
</li><br />
<li>Paper Classification<br />
<p> Classify papers by topics using the K-means clustering algorithm.</p><br />
</li><br />
</ol><br />
<br />
<br />
===Technologies===<br />
<br />
The HDFS with a Hadoop cluster composed of one master node, one sub-node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.<br />
<br />
===HDFS===<br />
<br />
Hadoop Distributed File System (HDFS) was used to process big data in this system. HFDS has been shown to process big data rapidly and stably with high scalability which makes it a perfect choice for this problem. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it has received.<br />
<br />
'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''<br />
<br />
=Data Preprocessing=<br />
===Crawling of Abstract Data===<br />
<br />
Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.<br />
<br />
An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterward, nouns are extracted, as a more condensed representation for efficient analysis.<br />
<br />
This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce, an easy-to-use programming model and implementation for processing and generating large data sets. The user must specify (i) a map procedure, that filters and sorts the input data to produce a set of intermediate key/value pairs and (ii) a reduce function, which performs a summary operation on the intermediate values with the same key and returns a smaller set of output key/value pairs. The MapReduce interface enables this process by grouping the intermediate values with the same key and passing them as input to the reduce function. For example, one could count the number of times various words appear in a large number of documents by setting your map procedure to count the number of occurrences of each word in a single document, and your reduce function to sum all counts of a given word [[https://dl.acm.org/doi/pdf/10.1145/1327452.1327492?casa_token=_Zg_DWxQzKEAAAAA:EHII0CaP36_ojGMT8huqTGLNMSEc-CKzZAoXBxSXe6pr2WB0DCQvEKa30CFQW0NSbB2-CVo8GcBcJAg 1]].<br />
<br />
===Managing Paper Data===<br />
<br />
To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data, where words are reduced to their word stem. An example is "running" and "ran" would be reduced to "run". 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.<br />
<br />
<div align="center">[[File:table_1_kswf.JPG|700px]]</div><br />
<br />
=Topic Modeling Using LDA=<br />
<br />
Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.<br />
<br />
LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> (probability of word "z" having topic "t") and the document-topic distribution <math>P\left(z | d\right)</math> (probability of finding word "z" within a given document "d") using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:<br />
<br />
\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]<br />
<br />
In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.<br />
<br />
<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div><br />
<br />
<br />
===LDA Intuition===<br />
<br />
LDA uses the Dirichlet priors of the Dirichlet distribution, which allows the algorithm to model a probability distribution ''over prior probability distributions of words and topics''. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles. <br />
<br />
<div align="center">[[File:dirichlet_dist.png|700px]]</div><br />
<br />
Simplex is a generalization of the notion of a triangle in k-1 dimension where k is the number of classes. For example, if you wish to classify essays into three groups, English, History and Math then the simplex would be a 2 dimension triangle, if you add philosophy as one of your potential class, then we would need a tetrahedron in 3 deminsion. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.<br />
<br />
The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.<br />
<br />
<div align="center">[[File:LDA_example.png|800px]]</div><br />
<br />
In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.<br />
<br />
=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=<br />
<br />
TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is<br />
* It evaluates the importance of a word within a document<br />
* It evaluates the importance of the word among the collection of all documents<br />
<br />
The inverse of the document frequency accounts for the fact that term frequency will naturally increase as document frequency increases. Thus IDF is needed to counteract a word's TF to give an accurate representation of a word's importance.<br />
<br />
The TF-IDF formula has the following form:<br />
<br />
\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]<br />
<br />
where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.<br />
<br />
===Term Frequency (TF)===<br />
<br />
TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.<br />
<br />
In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.<br />
<br />
The formula for TF has the following form:<br />
<br />
\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]<br />
<br />
where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, <math>n_{i,j}</math> stands for the number of times words <math>t_i</math> appear in document <math>d_j</math> and <math>\sum_k n_{k,j} </math> stands for total number of occurence of words in document <math>d_j</math>.<br />
<br />
Note that the denominator is the total number of words remaining in document j after crawling.<br />
<br />
===Document Frequency (DF)===<br />
<br />
DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is.<br />
<br />
<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:<br />
<br />
\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]<br />
<br />
where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.<br />
<br />
Since DF and the importance of the word have an inverse relation, we use inverse document frequency (IDF) instead of DF.<br />
<br />
===Inverse Document Frequency (IDF)===<br />
<br />
In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|<br />
<br />
The formula for IDF has the following form:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]<br />
<br />
As mentioned before, we will use HDFS. The actual formula applied is:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]<br />
<br />
The inverse document frequency gives a measure of how rare a certain term is in a given document corpus.<br />
<br />
=Paper Classification Using K-means Clustering=<br />
<br />
The K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can be applied to different types of data attributes. It is also flexible enough to handle various kinds of noise and outliers.<br />
<br><br />
<br />
Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.<br />
<br><br />
<br />
Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimize the sum of square error:<br />
<br />
\begin{align*}<br />
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2<br />
\end{align*}<br />
<br />
where<br />
<ul><br />
<li><math>k</math>: the number of clusters</li><br />
<li><math>C_i</math>: the <math>i^th</math> cluster</li><br />
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li><br />
<li><math>mu_i</math>: the centroid of <math>C_i</math></li><br />
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li><br />
</ul><br />
<br><br />
<br />
K-means Clustering algorithm, an unsupervised algorithm, is chosen because of its advantages to deal with different types of attributes, to run with minimal requirement of domain knowledge, to deal with noise and outliers, to realize clusters with similarities. <br />
<br />
<br />
Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.<br />
<br><br />
<br />
However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The Silhouette scheme is a measure of how well the objects lie within each cluster. Silhouette scores lie from -1 to 1. A positive score indicates that the object is well-matched with its own cluster, while a negative score indicates the opposite (Kaufman & Rousseeuw, 2005). The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.<br />
<br />
=System Testing Results=<br />
<br />
In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:<br />
<br />
<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div><br />
<br />
<br />
Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:<br />
<br />
<div align="center">[[File:table_4_nocobes.JPG|700px]]</div><br />
<br />
According to Table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.<br />
<br />
<br />
Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:<br />
<br />
<div align="center">[[File:table_5_asv.JPG|700px]]</div><br />
<br />
Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.<br />
<br />
<br />
To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:<br />
<br />
<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div><br />
<br />
Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.<br />
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.<br />
<br />
=Conclusion=<br />
<br />
This paper introduces a classification system that classifies research papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. The experimental results showed that the proposed system can classify the papers with similar subjects according to the keywords extracted from the abstracts of papers. The authors emphasized that the system can be implemented efficiently on high performance computing infrastructure, using industry-standard technologies. This system allows users to search the papers they want quickly and with the most productivity.<br />
<br />
Furthermore, this classification system might be also used in different types of texts (e.g. documents, tweets, etc.) instead of only classifying research papers.<br />
<br />
=Critique=<br />
<br />
In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.<br />
<br />
This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.<br />
<br />
When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?<br />
<br />
The preprocessing should also include <math>n</math>-gram tokenization for topic modelling because some topics are inherently two words, such as machine learning where if it is seen separately, it implies different topics.<br />
<br />
This system is very compute-intensive due to the large volumes of dictionaries that can be generated by processing large volumes of data. It would be nice to see how much data HDFS had to process and similarly how much time was saved by using Hadoop for data processing as opposed to centralized approach.<br />
<br />
This system can be improved further in terms of computation times by utilizing other big data framework MapReduce, that can also use HDFS, by parallelizing their computation across multiple nodes for K-means clustering as discussed in (Jin, et al) [5].<br />
<br />
It's not exactly clear what method 3 (TFIDF-LDA) is doing, how is it performing TF-IDF on the topics? Also it seems like the preprocessing step only keeps 10/20/30 top words? This seems like an extremely low number especially in comparison with the LDA which has 10/20/30 topics - what is the reason for so strongly limiting the number of words? It would also be interesting to see if both key words and topics are necessary - an ablation study showing the significance of both would be interesting.<br />
<br />
It is better if the paper has an example with some topics on some research papers. Also it is better if we can visualize the distance between each research paper and the topic names<br />
<br />
I am interested in the first step of the general framework, which is the Paper Crawling step. Many conferences actually require the authors to indicate several key words that best describe a paper. For example, a database paper may have keywords such as "large-scale database management", "information retrieval", and "relational table mining". So in addition to crawling text from abstract, it may be more effective to crawl these keywords directly. Not only does this require less time, these keywords may also lead to better performance than the nouns extracted from the abstract section. I am also slightly concerned about the claim made in the paper that "Our methodologies can be applied to text outside of research papers". Research papers are usually carefully revised and well-structured. Extending the algorithm described in the paper to any kind of free-text could be difficult in practice.<br />
<br />
The paper has very meaningful motivation, since the association of research topics and finding all the relevant previous work is indeed a challenging task at the initials stage of the research. It is often easy to miss a relevant paper published years ago which might be crucial to your own work. However, the classification task that the author tested in this work is almost useless, as the classification is too high-level. The overall scheme of classifying paper between categories like "cloud bigdata" or "IoT privacy" is too general to be meaningful. It is simply classifying the primary field computer science into its direct subfield, while most researchers only work on a niche much narrower than the subfield. Most online paper database, including arxiv, takes care of the subfield and even subsubfield classifcation during the stage of submission, which leaves the author's system with limited applicabilty. What we truly need is an algorithm able to classify and cluster papers based on detailed research topics and methodology. <br />
<br />
It would be better if the author could provide some application or example of the research algorithm in the real world. It would be helpful for the readers to understand the algorithm.<br />
<br />
The summary clearly goes through the model framework well, starting from data-preprocessing, prediction, and testing. It can be enhanced by applying this model to other similar use-cases and how well the prediction goes.<br />
<br />
It will be better if their is a comparison on the BM25 algorithm v.s. TF-IDF, which is usually get compared in IR papers<br />
<br />
The paper misses the details on subjects of research papers used to perform classifications. If the majority of research papers were about one subject, it could potentially produce biased results.<br />
<br />
The paper omits the details of the reason why Method 3 for constructing the Keyword dictionaries requires the least number of k-clusters as method 3 is a combination of methods 1 and 2. It would be of interest to investigate why Method 3 uses so little clusters (in comparison) as it seemed to be the most accurate of the 3 methods. (Also the graph comparing the results could be improved by using a variety of different hues of colours as it is difficult to distinguish some scores such as TFIDF_30 and TFIDF-LDA_30)<br />
<br />
The TF-IDF is interesting as it provides a normalized method to extract the most frequent term contained in the paper, while this method still has spaces of improvements. For example, in some machine learning papers, where special operations have to be done on the datasets, the name of dataset may appear multiple times within the paper. In fact, the main theme of the paper is on the novel machine learning algorithm, which may only be mentioned once. In that case, mis-predictions may occur, and a possible improvement here is to add weights to keywords appearing in each section. i.e the most frequent word in Abstract will have more weights than the most frequent word in Introduction.<br />
<br />
In my opinion, the paper glosses over a few technicalities. First, how does the proposed algorithm deal with subgroups and nested groups. The paper is assuming only one level of sorting, which may work for a sufficiently unique set of paper, but since the problem is meant to be generalized, many papers will have to have multi-level sorts. For example, the category 'machine learning' can be further divided into 'supervised' and 'unsupervised'. Is the algorithm able to handle this or would it create 2 groups (i.e. ML-supervised and ML-unsupervised)? Second, a popular LDA model is available through the gensim package which utilizes relevancy and saliency metrics - how does that factor into the quality of the topics? Third, what is the motivation in using TF-IDF scores for clustering? In my experience, using Word2Vec and BERT has been the industry standard for obtaining vectors to perform clustering on text.<br />
<br />
When working with a larger data set, Spark might be more efficient than Hadoop. Working with natural language, the preprocessing is very important which might significantly determine the accuracy of the results. Therefore, different preprocessing techniques should be used to have a comparison. Lastly, PCA T-SNE might be a good way to visualize the result data.<br />
<br />
=References=<br />
<br />
[1] Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022<br />
<br />
[2] Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7<br />
<br />
[3] Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879<br />
<br />
[4] Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY<br />
<br />
[5] Jin, Cui, Yu. (2016). A New Parallelization Method for K-means. https://arxiv.org/ftp/arxiv/papers/1608/1608.06347.pdf<br />
<br />
[6] Kaufman, L., & Rousseeuw, P. J. (2005). Graphical Output Concerning Each Clustering. In Finding groups in data : An introduction to cluster analysis (pp. 84-85). Hoboken, New Jersey: John Wiley & Sons. doi:10.1002/9780470316801</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice&diff=49606Speech2Face: Learning the Face Behind a Voice2020-12-06T23:20:52Z<p>Y2587wan: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face which utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model produces facial reconstruction images that capture specific physical attributes learning the correlations between faces and voices, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. This model explores what types of facial information could be extracted from speech without the constraints of predefined facial characterizations. Without any prior information or accurate classifiers, the reconstructions revealed correlations between craniofacial features and voice in addition to the correlation between dominant features (gender, age, ethnicity, etc.) and voice. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Ethical Considerations ==<br />
<br />
The authors note that due to the potential sensitivity of facial information, they have chosen to explicitly state some ethical considerations. The first of which is privacy. The paper states that the method cannot recover the true identity of the face or produce faces of specific individuals, but rather will show average-looking faces. The paper also addresses that there are potential dataset biases that exist for the voice-face correlations, thus the faces may not accurately represent the intended population. Finally, it acknowledges that the model uses demographic categories that are defined by a commercial face attribute classifier.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors are more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg|center]]<br />
<br />
<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div><br />
<br />
<br />
<br />
The Speech2Face Model used to achieve the desired result consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The combination of the voice encoder and face decoder results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. It needs a model to figure out many irrelevant variations in the data, and to implicitly extract important internal representations of faces. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. <br />
<br />
'''Face Decoder''' <br />
The face decoder itself was taken from previous work The VGG-Face model by Cole et al [3] (a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.) and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The face decoder kept the VGG-Face model's dimension and weights. The weights were also trained separately and remained fixed during the voice encoder training. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG|center]]<br />
<br />
<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div><br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the VGG-Face model <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer of face decoder <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div><br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div><br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 2. '''Feature similarity''' </div><br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2F -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div><br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
'''Additional Observations''' <br />
<br />
Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.<br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.<br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.<br />
<br />
8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.<br />
<br />
9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.<br />
<br />
10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.<br />
<br />
11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.<br />
<br />
12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.<br />
<br />
13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.<br />
<br />
14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.<br />
<br />
15. This experiment raises a lot of ethical complications when it comes to possible applications in the real world. Even if this model was highly accurate, the implications of being able to discern a person's racial ethnicity, skin tone, etc. based solely on there voice could play in to inherent biases in the application user and this may end up being an issue that needs to be combatted in future research in this area. Another possible issue is that many people will change their intonation or vocal features based on the context (I'll likely have a different voice pattern in a job interview in terms of projection, intonation, etc. than if I was casually chatting/mumbling with a friend while playing video games for example).<br />
<br />
16. Overall a very interesting topic. I want to talk about the technical challenged raised by using the AVSSpeech dataset for training. The paper acknowledges that the AVSSpeech is unbalanced, and 80% of the data are white and Asians. It also says in the results section that "Our model does not perform on other races due to the imbalance in data". There does not seem to be any effort made in balancing the data. I think that there are definitely some data processing techniques that can be used (filtering, data augmentation, etc) to address the class imbalance problem. Not seeing any of these in the paper is a bit disappointing. Another issue I have noticed is that the model aims to predict an average-looking face from certain gender/racial group from voice input, due to ethical considerations. If we cannot reveal the identify of a person, why don't we predict the gender and race directly? Giving an average-looking face does not seem to be the most helpful.<br />
<br />
17. Very interesting research paper to be studied and the main objective was also interesting. This research leads to open question which can be applied to another application such as predicting person's face using voice and can be used in more advanced way. The only risk is how the data is obtained from YouTube where data is not consistent.<br />
<br />
18. The essay uses millions of natural videos of people speaking to find the correlation between face and voice. Since face and voice are commonly used as the identity of a person, there are many possible research opportunities and applications about improving voice and face unlock.<br />
<br />
19. It would be better to have a future work section to discuss the current shortage and explore the possible improvement and applications in the future.<br />
<br />
20. While the idea behind Speech2Face is interesting, ethnic profiling is a huge concern and it can further lead to racial discrimination, racism etc. Developers must put more care and thought into applying Speech2Face in tech before deploying the products.<br />
<br />
21. It would be helpful if the author could explore the different applications of this project in real life. Speech2face can be helpful during criminal investigation and essentially in scenarios when someone's picture is missing and only voice is available. It would also be helpful if the author could state the importance and need of such kind project in the society.<br />
<br />
22. The authors mention that they use the AVSpeech dataset for both training and testing but do not talk about how they split the data. It is possible that the same speakers were used in the training and testing data and so the model is able to recreate a face simply by matching the observed face to the observed audio. This would explain the striking example images shown in the paper.<br />
<br />
23. Another interesting application of this research is automated speech or facial animation at scale or in multiple languages. The cutting-edge automated facial animation solution provided by JALI Research Inc is applied in Cyberpunk 2077.<br />
<br />
24. It would be interesting to know the model can predict a similar face when one is speaking different languages. A person who is speaking multiple languages can have different tones and accents depending on a language that they speak.<br />
<br />
25. The results are actually amazing for the introduction of Speech2Face. As others have mentioned, the researchers might have used a biased dataset of YouTube videos favoring certain ethnicities and their accents and dialects. Thus, it would be nice to also see the data distribution. Additionally it would be nice to see how their model reacts to people who are able to speak multiple languages and see how well Speech2Face generalizes different language pronunciations of one person.<br />
<br />
26. The paper introduces Speech2Face and it definitely is one of the major areas of researches in the future. In the paper, the confusion matrix indicates that the model tends to misclassify based on the age of the speaking person. Specifically, the model tends to misclassify between 40-70. It would be interesting to see if the model could improve on its bottleneck by training on more speeches by the age group 40-70.<br />
<br />
27. An interesting topic, and as others have mentioned, has many ethical considerations and implications. Particularly in regions where call-recording is permitted, there is dangerous potential to for the technology to be misused to identify and target individuals. It would also be interesting to get a more in depth exploration into how the language spoken and accents have a bias. For example, if a person speaks with a strong British accent, are they classified as white? Particularly for Spanish-speakers, they vary greatly with respect to their skin colour and features, how well does the algorithm work on these individuals. A last nit-pick is the labelling used (i.e. Asian, White, Indian, Black) as this is not accurate since Indians, and moreover South Asians, fall under Asian as well.<br />
<br />
28. This topic is quite interesting and it could have great contribution in terms of criminal fight. But as the result, the accuracy is essential. There is still the space for much improvements since to tell a person's face by his/her voice is pretty hard since there are many factors such as oral structure, the language environment and even personality. Great bias could be resulted from these unpredictable factors.<br />
<br />
29. This is an interesting topic and could have great use in terms of finding criminals or people when having their voice recorded. However, the voice recording might be noisy and some might include voices of multiple people. It could consider ways to eliminate those factors that might effect the accuracy of the face generation.<br />
<br />
30. Most contents described in the paper are very useful. However, YouTube might not be a good enough data source since there are fewer labels to classify. Perhaps, after generating the model, the transfer learning could be done based on Facebook's videos in order to solve the imbalanced problem.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.<br />
<br />
[7] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 24-May-2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=49604Evaluating Machine Accuracy on ImageNet2020-12-06T23:12:35Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, researchers need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. <br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
# The original question would be how ImageNet was created? The images can only be labeled by human labelers. There might be some mistakes and missing details. The combination of human evaluation and machine evaluation might be helpful to resolve this problem.<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=49598Evaluating Machine Accuracy on ImageNet2020-12-06T23:04:27Z<p>Y2587wan: /* Multi-label annotations */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, researchers need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. <br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNet&diff=49594Evaluating Machine Accuracy on ImageNet2020-12-06T23:02:20Z<p>Y2587wan: /* Human Labeler Evaluation */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is the most influential dataset in machine learning with images and corresponding labels over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset from the ImageNet validation set, and all images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset. <br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy.<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months, and two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent does models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The author's suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, researchers need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. <br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=49591Surround Vehicle Motion Prediction2020-12-06T23:00:45Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. Kalman filtering is a common technique used in sensor fusion for state estimation that allows the vehicle's state to be predicted while taking into account the uncertainty associated with inputs and measurements. However, the performance of the Kalman Filter in predicting multi-step problems is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that little research has been dedicated on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behaviour at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
<center><br />
[[ File:intersection.png |300px]]<br />
</center><br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Sensor Outputs ===<br />
<br />
The input of the target perceptions is from the output of the sensors. The data collected in this article uses 6 different sensors with feature fusion to detect traffic in the range up to 100m: 1) LiDAR system outputs: Relative position, heading, velocity, and box size in local coordinates; 2) Around-View Monitoring (AVM) and 3)GPS outputs: acquire lanes, road marker, global position; 4) Gateway engine outputs: precise global position in urban road environment; 5) Micro-Autobox II and 6) a MDPS are used to control and actuate the subject. All data are stored in an industrial PC.<br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The dataset was collected using a human driven Autonomous Vehicle(AV) that was equipped with sensors to track motion the vehicle's surroundings. In addition, the motion sensors they used a front camera, Around-View-Monitor and GPS to acquire the lanes, road markers and global position. The data was collected in the urban roads of Gwanak-gu, Seoul, South Korea. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement, which is the sequential previous motion. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network that is suitable for use with sequential data because it has recurrent connections on its hidden nodes and thus, can retain its state or memory while processing the next input or sequence of inputs. For this reason, RNNs can be used to analyze time-series data where the pattern of the data depends on the time flow. This is an impossible task for traditional artificial neural networks, which assume the inputs are independent of one another. RNNs can also contain feedback loops that allow activations to flow alternately in the loop. <br />
<br />
In line with traditional neural networks, RNNs still suffer from the problem of vanishing gradients. An LSTM avoids this by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
this article introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than the base <br />
algorithm. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
This paper has identified several venues for future research, which include:<br />
<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle companies such as Tesla, which now already has a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a very hot topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
It would be better if the authors compared their method against other SOTA methods. Also one of the reasons motion planning is done using interpretable methods rather than black boxes (such as this model) is because it is hard to see where things go wrong and fix problems with the black box when they occur - this is something the authors should have also discussed.<br />
<br />
A future area of study is to combine other source of information such as signals from Lidar or car side cameras to make a better prediction model.<br />
<br />
It might be interesting and helpful to conduct some training and testing under different weather/environmental conditions, as it could provide more generalization to real-life driving scenarios. For example, foggy weather and evening (low light) conditions might affect the performance of sensors, and rainy weather might require a longer braking distance.<br />
<br />
This paper proposes an interesting, novel model prediction algorithm, using LSTM_RNN. However, since motion prediction in autonomous driving has great real-life impacts, I do believe that the evaluations of the algorithm should be more thorough. For example, more traditional motion planning algorithms such as multi-modal estimation and Kalman filters should be used as benchmarks. Moreover, the experiment results are based on Korean driving conditions only. Eastern and Western drivers can have very different driving patterns, so that should be addressed in the discussion section of the paper as well.<br />
<br />
The paper mentions that in the future, this research plans to learn the real life behaviour of automated vehicles. Seeing a possible improvement in road safety due to this research will be very interesting.<br />
<br />
This predictor is also possible to be applied in the traffic control system.<br />
<br />
This prediction model should consider various conditions that could happen in an intersection. However, normal prediction may not work when there is a traffic jam or in some crowded time periods like rush hours.<br />
<br />
It would be better that the author could provide more comparison between the LSTN-RNN algorithm and other traditional algorithm such as RNN or just LSTM.<br />
<br />
The paper has really good results for what they aimed to achieve. However for the future work it would also be nice to have various climates/weathers to be included in the Seoul dataset. I think it's also important to consider it as different climates/weather (such as snowy roads, or rain) would introduce more noisier data (camera's image processing) and the human drivers behaviour would change as well to adapt to the new environment.<br />
<br />
It would be good to have a future work section to discusses shortage of current algorithms and the possible improvement.<br />
<br />
The summary explains the whole process well, but is missing the small details among the steps. It would be better to explain concepts such as RNN, modelling procedure for first time users.<br />
<br />
This paper presents a nice method, but does not seem particularly well developed. I would have liked to see some more ablations on this particular choice of RNN, as there are more efficient variants such as GRU which show similar performance in other tasks while being more amenable to real-time inference. Furthermore, the multi-model aspect seems slightly ad-hoc, it would have been nice to see a more rigorous formulation similar to seen in some recent work by Zeng et al. from Uber ATG: https://arxiv.org/pdf/2008.06041.pdf.<br />
<br />
The data used for this paper contains driver information exclusively to the urban roads of Gwanak-gu Seoul, hence the data may contain an inherited bias as drivers around the rest of the country, let alone the rest of the world, will have different habits based on different environments. It would be interesting to see if this model can be applied to other cities around the world and exhibit similar results or would there be a need to tune it based off geographic location.<br />
<br />
Since the data is based on urban roads, It would be better to include the details on performance of the model on high traffic area vs low traffic urban area. It would also be interesting to see the performance of the model with many pedestrians.<br />
<br />
While it would be nice to read more on why the authors chose LSTM-RNN, the paper exhibits a potential way to improve autonomous vehicle performance. It would be interesting to see how an army of robots would behave when this paper's method is applied in robotics, since robots' motions also follow a trajectory.<br />
<br />
An interesting topic, but the paper and accompanying summary are missing some details that would improve understandability. With respect to the background of the topic, a more detailed explanation of trajectories in the case of driving would help to better motivate the research. The addition of benchmarks and comparisons to current industry standards (if they are published publicly) would help to contextualize the results of the LSTM-RNN. An area of further study is applying these techniques to different weather situations and driving patterns in different countries. How does this model perform in regions where driver's very loosely follow the laws of the road? Further, could the research be generalized for controlled and uncontrolled turn lanes, especially on roads with higher speed limits?<br />
<br />
In the past, LSTM is a very popular model in natural language. It is interesting to see how it is used for motion prediction. Although the result is not good enough, it is still a good start. In the future, more language models can be used to see how well they can perform to do a comparison such as BiLSTM, Transformer, and Bert.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=49589Surround Vehicle Motion Prediction2020-12-06T22:55:51Z<p>Y2587wan: /* Case study of a multi-lane left turn scenario */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. Kalman filtering is a common technique used in sensor fusion for state estimation that allows the vehicle's state to be predicted while taking into account the uncertainty associated with inputs and measurements. However, the performance of the Kalman Filter in predicting multi-step problems is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that little research has been dedicated on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behaviour at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
<center><br />
[[ File:intersection.png |300px]]<br />
</center><br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Sensor Outputs ===<br />
<br />
The input of the target perceptions is from the output of the sensors. The data collected in this article uses 6 different sensors with feature fusion to detect traffic in the range up to 100m: 1) LiDAR system outputs: Relative position, heading, velocity, and box size in local coordinates; 2) Around-View Monitoring (AVM) and 3)GPS outputs: acquire lanes, road marker, global position; 4) Gateway engine outputs: precise global position in urban road environment; 5) Micro-Autobox II and 6) a MDPS are used to control and actuate the subject. All data are stored in an industrial PC.<br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The dataset was collected using a human driven Autonomous Vehicle(AV) that was equipped with sensors to track motion the vehicle's surroundings. In addition, the motion sensors they used a front camera, Around-View-Monitor and GPS to acquire the lanes, road markers and global position. The data was collected in the urban roads of Gwanak-gu, Seoul, South Korea. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement, which is the sequential previous motion. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network that is suitable for use with sequential data because it has recurrent connections on its hidden nodes and thus, can retain its state or memory while processing the next input or sequence of inputs. For this reason, RNNs can be used to analyze time-series data where the pattern of the data depends on the time flow. This is an impossible task for traditional artificial neural networks, which assume the inputs are independent of one another. RNNs can also contain feedback loops that allow activations to flow alternately in the loop. <br />
<br />
In line with traditional neural networks, RNNs still suffer from the problem of vanishing gradients. An LSTM avoids this by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
this article introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than the base <br />
algorithm. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
This paper has identified several venues for future research, which include:<br />
<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle companies such as Tesla, which now already has a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a very hot topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
It would be better if the authors compared their method against other SOTA methods. Also one of the reasons motion planning is done using interpretable methods rather than black boxes (such as this model) is because it is hard to see where things go wrong and fix problems with the black box when they occur - this is something the authors should have also discussed.<br />
<br />
A future area of study is to combine other source of information such as signals from Lidar or car side cameras to make a better prediction model.<br />
<br />
It might be interesting and helpful to conduct some training and testing under different weather/environmental conditions, as it could provide more generalization to real-life driving scenarios. For example, foggy weather and evening (low light) conditions might affect the performance of sensors, and rainy weather might require a longer braking distance.<br />
<br />
This paper proposes an interesting, novel model prediction algorithm, using LSTM_RNN. However, since motion prediction in autonomous driving has great real-life impacts, I do believe that the evaluations of the algorithm should be more thorough. For example, more traditional motion planning algorithms such as multi-modal estimation and Kalman filters should be used as benchmarks. Moreover, the experiment results are based on Korean driving conditions only. Eastern and Western drivers can have very different driving patterns, so that should be addressed in the discussion section of the paper as well.<br />
<br />
The paper mentions that in the future, this research plans to learn the real life behaviour of automated vehicles. Seeing a possible improvement in road safety due to this research will be very interesting.<br />
<br />
This predictor is also possible to be applied in the traffic control system.<br />
<br />
This prediction model should consider various conditions that could happen in an intersection. However, normal prediction may not work when there is a traffic jam or in some crowded time periods like rush hours.<br />
<br />
It would be better that the author could provide more comparison between the LSTN-RNN algorithm and other traditional algorithm such as RNN or just LSTM.<br />
<br />
The paper has really good results for what they aimed to achieve. However for the future work it would also be nice to have various climates/weathers to be included in the Seoul dataset. I think it's also important to consider it as different climates/weather (such as snowy roads, or rain) would introduce more noisier data (camera's image processing) and the human drivers behaviour would change as well to adapt to the new environment.<br />
<br />
It would be good to have a future work section to discusses shortage of current algorithms and the possible improvement.<br />
<br />
The summary explains the whole process well, but is missing the small details among the steps. It would be better to explain concepts such as RNN, modelling procedure for first time users.<br />
<br />
This paper presents a nice method, but does not seem particularly well developed. I would have liked to see some more ablations on this particular choice of RNN, as there are more efficient variants such as GRU which show similar performance in other tasks while being more amenable to real-time inference. Furthermore, the multi-model aspect seems slightly ad-hoc, it would have been nice to see a more rigorous formulation similar to seen in some recent work by Zeng et al. from Uber ATG: https://arxiv.org/pdf/2008.06041.pdf.<br />
<br />
The data used for this paper contains driver information exclusively to the urban roads of Gwanak-gu Seoul, hence the data may contain an inherited bias as drivers around the rest of the country, let alone the rest of the world, will have different habits based on different environments. It would be interesting to see if this model can be applied to other cities around the world and exhibit similar results or would there be a need to tune it based off geographic location.<br />
<br />
Since the data is based on urban roads, It would be better to include the details on performance of the model on high traffic area vs low traffic urban area. It would also be interesting to see the performance of the model with many pedestrians.<br />
<br />
While it would be nice to read more on why the authors chose LSTM-RNN, the paper exhibits a potential way to improve autonomous vehicle performance. It would be interesting to see how an army of robots would behave when this paper's method is applied in robotics, since robots' motions also follow a trajectory.<br />
<br />
An interesting topic, but the paper and accompanying summary are missing some details that would improve understandability. With respect to the background of the topic, a more detailed explanation of trajectories in the case of driving would help to better motivate the research. The addition of benchmarks and comparisons to current industry standards (if they are published publicly) would help to contextualize the results of the LSTM-RNN. An area of further study is applying these techniques to different weather situations and driving patterns in different countries. How does this model perform in regions where driver's very loosely follow the laws of the road? Further, could the research be generalized for controlled and uncontrolled turn lanes, especially on roads with higher speed limits? <br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction&diff=49586Surround Vehicle Motion Prediction2020-12-06T22:53:30Z<p>Y2587wan: /* Data */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach for predicting the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. Kalman filtering is a common technique used in sensor fusion for state estimation that allows the vehicle's state to be predicted while taking into account the uncertainty associated with inputs and measurements. However, the performance of the Kalman Filter in predicting multi-step problems is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that little research has been dedicated on predicting the trajectory of intersections. Moreover, public data sets for analyzing driver behaviour at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
<center><br />
[[ File:intersection.png |300px]]<br />
</center><br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Sensor Outputs ===<br />
<br />
The input of the target perceptions is from the output of the sensors. The data collected in this article uses 6 different sensors with feature fusion to detect traffic in the range up to 100m: 1) LiDAR system outputs: Relative position, heading, velocity, and box size in local coordinates; 2) Around-View Monitoring (AVM) and 3)GPS outputs: acquire lanes, road marker, global position; 4) Gateway engine outputs: precise global position in urban road environment; 5) Micro-Autobox II and 6) a MDPS are used to control and actuate the subject. All data are stored in an industrial PC.<br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The dataset was collected using a human driven Autonomous Vehicle(AV) that was equipped with sensors to track motion the vehicle's surroundings. In addition, the motion sensors they used a front camera, Around-View-Monitor and GPS to acquire the lanes, road markers and global position. The data was collected in the urban roads of Gwanak-gu, Seoul, South Korea. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement, which is the sequential previous motion. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network that is suitable for use with sequential data because it has recurrent connections on its hidden nodes and thus, can retain its state or memory while processing the next input or sequence of inputs. For this reason, RNNs can be used to analyze time-series data where the pattern of the data depends on the time flow. This is an impossible task for traditional artificial neural networks, which assume the inputs are independent of one another. RNNs can also contain feedback loops that allow activations to flow alternately in the loop. <br />
<br />
In line with traditional neural networks, RNNs still suffer from the problem of vanishing gradients. An LSTM avoids this by making errors flow backward without a limit on the number of virtual layers. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. <br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target <br />
the vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results form the proposed algorithm and human driving decisions, <br />
this article introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human driver’s acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than the base <br />
algorithm. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
This paper has identified several venues for future research, which include:<br />
<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle companies such as Tesla, which now already has a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a very hot topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
It would be better if the authors compared their method against other SOTA methods. Also one of the reasons motion planning is done using interpretable methods rather than black boxes (such as this model) is because it is hard to see where things go wrong and fix problems with the black box when they occur - this is something the authors should have also discussed.<br />
<br />
A future area of study is to combine other source of information such as signals from Lidar or car side cameras to make a better prediction model.<br />
<br />
It might be interesting and helpful to conduct some training and testing under different weather/environmental conditions, as it could provide more generalization to real-life driving scenarios. For example, foggy weather and evening (low light) conditions might affect the performance of sensors, and rainy weather might require a longer braking distance.<br />
<br />
This paper proposes an interesting, novel model prediction algorithm, using LSTM_RNN. However, since motion prediction in autonomous driving has great real-life impacts, I do believe that the evaluations of the algorithm should be more thorough. For example, more traditional motion planning algorithms such as multi-modal estimation and Kalman filters should be used as benchmarks. Moreover, the experiment results are based on Korean driving conditions only. Eastern and Western drivers can have very different driving patterns, so that should be addressed in the discussion section of the paper as well.<br />
<br />
The paper mentions that in the future, this research plans to learn the real life behaviour of automated vehicles. Seeing a possible improvement in road safety due to this research will be very interesting.<br />
<br />
This predictor is also possible to be applied in the traffic control system.<br />
<br />
This prediction model should consider various conditions that could happen in an intersection. However, normal prediction may not work when there is a traffic jam or in some crowded time periods like rush hours.<br />
<br />
It would be better that the author could provide more comparison between the LSTN-RNN algorithm and other traditional algorithm such as RNN or just LSTM.<br />
<br />
The paper has really good results for what they aimed to achieve. However for the future work it would also be nice to have various climates/weathers to be included in the Seoul dataset. I think it's also important to consider it as different climates/weather (such as snowy roads, or rain) would introduce more noisier data (camera's image processing) and the human drivers behaviour would change as well to adapt to the new environment.<br />
<br />
It would be good to have a future work section to discusses shortage of current algorithms and the possible improvement.<br />
<br />
The summary explains the whole process well, but is missing the small details among the steps. It would be better to explain concepts such as RNN, modelling procedure for first time users.<br />
<br />
This paper presents a nice method, but does not seem particularly well developed. I would have liked to see some more ablations on this particular choice of RNN, as there are more efficient variants such as GRU which show similar performance in other tasks while being more amenable to real-time inference. Furthermore, the multi-model aspect seems slightly ad-hoc, it would have been nice to see a more rigorous formulation similar to seen in some recent work by Zeng et al. from Uber ATG: https://arxiv.org/pdf/2008.06041.pdf.<br />
<br />
The data used for this paper contains driver information exclusively to the urban roads of Gwanak-gu Seoul, hence the data may contain an inherited bias as drivers around the rest of the country, let alone the rest of the world, will have different habits based on different environments. It would be interesting to see if this model can be applied to other cities around the world and exhibit similar results or would there be a need to tune it based off geographic location.<br />
<br />
Since the data is based on urban roads, It would be better to include the details on performance of the model on high traffic area vs low traffic urban area. It would also be interesting to see the performance of the model with many pedestrians.<br />
<br />
While it would be nice to read more on why the authors chose LSTM-RNN, the paper exhibits a potential way to improve autonomous vehicle performance. It would be interesting to see how an army of robots would behave when this paper's method is applied in robotics, since robots' motions also follow a trajectory.<br />
<br />
An interesting topic, but the paper and accompanying summary are missing some details that would improve understandability. With respect to the background of the topic, a more detailed explanation of trajectories in the case of driving would help to better motivate the research. The addition of benchmarks and comparisons to current industry standards (if they are published publicly) would help to contextualize the results of the LSTM-RNN. An area of further study is applying these techniques to different weather situations and driving patterns in different countries. How does this model perform in regions where driver's very loosely follow the laws of the road? Further, could the research be generalized for controlled and uncontrolled turn lanes, especially on roads with higher speed limits? <br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48544Mask RCNN2020-11-30T22:29:12Z<p>Y2587wan: /* Visual Perception tasks */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN, extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stage 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabelled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
An interesting application of Mask RCNN would be on face recognization from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:HumanPose.png&diff=48543File:HumanPose.png2020-11-30T22:27:02Z<p>Y2587wan: </p>
<hr />
<div></div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48542Mask RCNN2020-11-30T22:26:39Z<p>Y2587wan: </p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN, extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stage 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. <br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabelled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
An interesting application of Mask RCNN would be on face recognization from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNN&diff=48541Mask RCNN2020-11-30T22:14:55Z<p>Y2587wan: /* Model Architecture */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN [1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image. <br />
Mask R-CNN, extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Build on image classification but localize each object in an image<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture for Instance Segmentation.<br />
<br />
== Related Work == <br />
Region Proposal Network: A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section.<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network, proposes candidate object bounding boxes. <br />
The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is actually a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve: 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stage 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other, but in this particular case, with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, which makes this formula the key for good instance segmentation results.<br />
<br />
Another important concept involved is called RoIAlign. This concept is useful in stage 2 where the RoIPool extracts <br />
features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to pixel-to-pixel correspondence. The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. Also, instead of quantization, the coordinates are computed using bilinear interpolation to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
There are some implementation details that should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabelled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
An interesting application of Mask RCNN would be on face recognization from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problems for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation in order to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=47397User:J46hou2020-11-28T21:08:18Z<p>Y2587wan: /* Popular Dataset Benchmark Result */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this paper, the “one-class” classification, whose goal is to obtain accurate discriminators for a special class, has been studied. Popular uses of this technique include anomaly detection which is widely used in detecting outliers. Anomaly detection is a well-studied area of research; however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer look "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transformation accurately.<br />
<br />
Thus, in this paper, a new approach called Deep Robust One-Class Classification (DROCC) was presented to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of the art methodologies to tackle these kinds of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods, and Side-information based AD. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
'''AD via Generative Modeling:''' involves deep autoencoders and GAN based methods and have been deeply studied. But, this method solves a much harder problem then required and reconstructs the entire input during the decoding step<br />
<br />
'''Deep Once Class SVM:''' was the first method to introduce deep one-class classification for the purpose of anomaly detection, but is impeded by representation collapse.<br />
<br />
'''Transformations based methods:''' Are more recent methods that are based on self-supervised training. The training process of these methods applies transformations to the regular points and training then trains the classifier to identify the transformations used. The model relies on the assumption that a point is normal iff the transformations applied to the point can be identified. Some proposed transformations are as simple as rotations and flips, or can be handcrafted and much more complicated. The various transformations that have been proposed are heavily domain dependent and are hard to design.<br />
<br />
'''Side-information based AD:''' incorporate labelled anomalous data or out-of-distribution samples. DROCC makes no assumptions regarding access to side-information.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">'''Figure 1'''</div><br />
<br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. We observe from here that DROCC is able to capture the manifold accurately; whereas the classical methods OC-SVM and DeepSVDD perform poorly as they both try to learn a minimum enclosing ball for the whole set of positive data points. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
The algorithm that was used to train the model is laid out below in pseudocode.<br />
<center><br />
[[File:DROCCtrain.png]]<br />
</center><br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], DROCC–LF, an outlier-exposure style extension of DROCC was proposed. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">'''Figure 2: AUC result'''</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high as 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">'''Figure 3: F1-Score'''</div><br />
<br />
Figure 3 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from the UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about an 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">'''Figure 4: Sample positives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).'''</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on the MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">'''Figure 5: OCLN on Audio Commands.'''</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous, etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volcanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier<br />
<br />
5. This is a nice summary and the authors introduce clearly on the performance of DROCC. It is nice to use Alexa as an example to catch readers' attention. I think it will be nice to include the algorithm of the DROCC or the architecture of DROCC in this summary to help us know the whole view of this method. Maybe it will be interesting to apply DROCC in biomedical studies? since one-class classification is often used in biomedical studies.<br />
<br />
6. The training method resembles adversarial learning with gradient ascent, however, there is no evaluation of this method on adversarial examples. This is quite unusual considering the paper proposed a method for robust one-class classification, and can be a security threat in real life in critical applications.<br />
<br />
7. The underlying idea behind OCLN is very similar to how neural networks are implemented in recommender systems and trained over positive/negative triplet models. In that case as well, due to the nature of implicit and explicit feedback, positive data tends to dominate the system. It would be interesting to see if insights from that area could be used to further boost the model presented in this paper.<br />
<br />
8. The paper shows the performance of DROCC being evaluated for time series data. It is interesting to see high AUC scores for DROCC against baselines like nearest neighbours and REBMs.Because detecting abnormal data in time series datasets is not common to practise.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=47396User:J46hou2020-11-28T21:07:46Z<p>Y2587wan: /* Popular Dataset Benchmark Result */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this paper, the “one-class” classification, whose goal is to obtain accurate discriminators for a special class, has been studied. Popular uses of this technique include anomaly detection which is widely used in detecting outliers. Anomaly detection is a well-studied area of research; however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer look "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transformation accurately.<br />
<br />
Thus, in this paper, a new approach called Deep Robust One-Class Classification (DROCC) was presented to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of the art methodologies to tackle these kinds of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods, and Side-information based AD. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
'''AD via Generative Modeling:''' involves deep autoencoders and GAN based methods and have been deeply studied. But, this method solves a much harder problem then required and reconstructs the entire input during the decoding step<br />
<br />
'''Deep Once Class SVM:''' was the first method to introduce deep one-class classification for the purpose of anomaly detection, but is impeded by representation collapse.<br />
<br />
'''Transformations based methods:''' Are more recent methods that are based on self-supervised training. The training process of these methods applies transformations to the regular points and training then trains the classifier to identify the transformations used. The model relies on the assumption that a point is normal iff the transformations applied to the point can be identified. Some proposed transformations are as simple as rotations and flips, or can be handcrafted and much more complicated. The various transformations that have been proposed are heavily domain dependent and are hard to design.<br />
<br />
'''Side-information based AD:''' incorporate labelled anomalous data or out-of-distribution samples. DROCC makes no assumptions regarding access to side-information.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">'''Figure 1'''</div><br />
<br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. We observe from here that DROCC is able to capture the manifold accurately; whereas the classical methods OC-SVM and DeepSVDD perform poorly as they both try to learn a minimum enclosing ball for the whole set of positive data points. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
The algorithm that was used to train the model is laid out below in pseudocode.<br />
<center><br />
[[File:DROCCtrain.png]]<br />
</center><br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], DROCC–LF, an outlier-exposure style extension of DROCC was proposed. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">'''Figure 2: AUC result'''</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high as 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">'''Figure 3: F1-Score'''</div><br />
<br />
Table 3 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from the UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about an 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">'''Figure 4: Sample positives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).'''</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on the MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">'''Figure 5: OCLN on Audio Commands.'''</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous, etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volcanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier<br />
<br />
5. This is a nice summary and the authors introduce clearly on the performance of DROCC. It is nice to use Alexa as an example to catch readers' attention. I think it will be nice to include the algorithm of the DROCC or the architecture of DROCC in this summary to help us know the whole view of this method. Maybe it will be interesting to apply DROCC in biomedical studies? since one-class classification is often used in biomedical studies.<br />
<br />
6. The training method resembles adversarial learning with gradient ascent, however, there is no evaluation of this method on adversarial examples. This is quite unusual considering the paper proposed a method for robust one-class classification, and can be a security threat in real life in critical applications.<br />
<br />
7. The underlying idea behind OCLN is very similar to how neural networks are implemented in recommender systems and trained over positive/negative triplet models. In that case as well, due to the nature of implicit and explicit feedback, positive data tends to dominate the system. It would be interesting to see if insights from that area could be used to further boost the model presented in this paper.<br />
<br />
8. The paper shows the performance of DROCC being evaluated for time series data. It is interesting to see high AUC scores for DROCC against baselines like nearest neighbours and REBMs.Because detecting abnormal data in time series datasets is not common to practise.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=47395User:J46hou2020-11-28T21:06:23Z<p>Y2587wan: /* Motivation */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this paper, the “one-class” classification, whose goal is to obtain accurate discriminators for a special class, has been studied. Popular uses of this technique include anomaly detection which is widely used in detecting outliers. Anomaly detection is a well-studied area of research; however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer look "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transformation accurately.<br />
<br />
Thus, in this paper, a new approach called Deep Robust One-Class Classification (DROCC) was presented to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of the art methodologies to tackle these kinds of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods, and Side-information based AD. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
'''AD via Generative Modeling:''' involves deep autoencoders and GAN based methods and have been deeply studied. But, this method solves a much harder problem then required and reconstructs the entire input during the decoding step<br />
<br />
'''Deep Once Class SVM:''' was the first method to introduce deep one-class classification for the purpose of anomaly detection, but is impeded by representation collapse.<br />
<br />
'''Transformations based methods:''' Are more recent methods that are based on self-supervised training. The training process of these methods applies transformations to the regular points and training then trains the classifier to identify the transformations used. The model relies on the assumption that a point is normal iff the transformations applied to the point can be identified. Some proposed transformations are as simple as rotations and flips, or can be handcrafted and much more complicated. The various transformations that have been proposed are heavily domain dependent and are hard to design.<br />
<br />
'''Side-information based AD:''' incorporate labelled anomalous data or out-of-distribution samples. DROCC makes no assumptions regarding access to side-information.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">'''Figure 1'''</div><br />
<br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. We observe from here that DROCC is able to capture the manifold accurately; whereas the classical methods OC-SVM and DeepSVDD perform poorly as they both try to learn a minimum enclosing ball for the whole set of positive data points. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
The algorithm that was used to train the model is laid out below in pseudocode.<br />
<center><br />
[[File:DROCCtrain.png]]<br />
</center><br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], DROCC–LF, an outlier-exposure style extension of DROCC was proposed. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">'''Figure 2: AUC result'''</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high as 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">'''Figure 3: F1-Score'''</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from the UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about an 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">'''Figure 4: Sample positives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).'''</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on the MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">'''Figure 5: OCLN on Audio Commands.'''</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous, etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volcanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier<br />
<br />
5. This is a nice summary and the authors introduce clearly on the performance of DROCC. It is nice to use Alexa as an example to catch readers' attention. I think it will be nice to include the algorithm of the DROCC or the architecture of DROCC in this summary to help us know the whole view of this method. Maybe it will be interesting to apply DROCC in biomedical studies? since one-class classification is often used in biomedical studies.<br />
<br />
6. The training method resembles adversarial learning with gradient ascent, however, there is no evaluation of this method on adversarial examples. This is quite unusual considering the paper proposed a method for robust one-class classification, and can be a security threat in real life in critical applications.<br />
<br />
7. The underlying idea behind OCLN is very similar to how neural networks are implemented in recommender systems and trained over positive/negative triplet models. In that case as well, due to the nature of implicit and explicit feedback, positive data tends to dominate the system. It would be interesting to see if insights from that area could be used to further boost the model presented in this paper.<br />
<br />
8. The paper shows the performance of DROCC being evaluated for time series data. It is interesting to see high AUC scores for DROCC against baselines like nearest neighbours and REBMs.Because detecting abnormal data in time series datasets is not common to practise.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=47393User:J46hou2020-11-28T21:04:41Z<p>Y2587wan: /* Previous Work */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this paper, the “one-class” classification, whose goal is to obtain accurate discriminators for a special class, has been studied. Popular uses of this technique include anomaly detection which is widely used in detecting outliers. Anomaly detection is a well-studied area of research; however, the conventional approach of modeling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. <br />
<br />
Deep learning based on anomaly detection methods attempts to learn features automatically but has some limitations. One approach is based on extending the classical data modeling techniques over the learned representations, but in this case, all the points may be mapped to a single point which makes the layer look "perfect". The second approach is based on learning the salient geometric structure of data and training the discriminator to predict applied transformation. The result could be considered anomalous if the discriminator fails to predict the transformation accurately.<br />
<br />
Thus, in this paper, a new approach called Deep Robust One-Class Classification (DROCC) was presented to solve the above concerns. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data and exploits the negative examples to learn a Mahalanobis distance.<br />
<br />
== Previous Work ==<br />
Traditional approaches for one-class problems include one-class SVM (Scholkopf et al., 1999) and Isolation Forest (Liu et al., 2008)[9]. One drawback of these approaches is that they involve careful feature engineering when applied to structured domains like images. The current state of the art methodologies to tackle these kinds of problems are: <br />
<br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some shortcomings in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
<br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations.<br />
<br />
3. Approach based on balancing unbalanced training data sets using methods such as SMOTE to synthetically create outlier data to train models on.<br />
<br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. The goal is to identify the outliers - the points not following a typical distribution. <br />
[[File:abnormal.jpeg | thumb | center | 1000px | Abnormal Data (Data Driven Investor, 2020)]]<br />
Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the low-dimensional subspace or a tree-structured partition of the input space to detect anomalies (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. They broadly fall into three categories: AD via generative modeling, Deep Once Class SVM, Transformations based methods, and Side-information based AD. While these techniques are well-suited when the input is featured appropriately, they struggle on complex domains like vision and speech, where hand-designing features are difficult.<br />
<br />
'''AD via Generative Modeling:''' involves deep autoencoders and GAN based methods and have been deeply studied. But, this method solves a much harder problem then required and reconstructs the entire input during the decoding step<br />
<br />
'''Deep Once Class SVM:''' was the first method to introduce deep one-class classification for the purpose of anomaly detection, but is impeded by representation collapse.<br />
<br />
'''Transformations based methods:''' Are more recent methods that is based on self-supervised training. The training process of these methods apply transformations to the regular points and training then train the classifier to identify the transformations used. The model relies on the assumption that a point is normal iff the transformations applied to the point can be identified. Some proposed transformations are as simple as rotations and flips, or can be handcrafted and much more complicated. The various transformations that have been proposed are heavily domain dependent and are hard to design.<br />
<br />
'''Side-information based AD:''' incorporate labelled anomalous data or out-of-distribution samples. DROCC makes no assumptions regarding access to side-information.<br />
<br />
Another related problem is the one-class classification under limited negatives (OCLN). In this case, only a few negative samples are available. The goal is to find a classifier that would not misfire close negatives so that the false positive rate will be low. <br />
<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">'''Figure 1'''</div><br />
<br />
(a): A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b): Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. We observe from here that DROCC is able to capture the manifold accurately; whereas the classical methods OC-SVM and DeepSVDD perform poorly as they both try to learn a minimum enclosing ball for the whole set of positive data points. <br />
<br />
(c), (d): First two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lies on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
<br />
The algorithm that was used to train the model is laid out below in pseudocode.<br />
<center><br />
[[File:DROCCtrain.png]]<br />
</center><br />
<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7], DROCC–LF, an outlier-exposure style extension of DROCC was proposed. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. In addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">'''Figure 2: AUC result'''</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high as 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">'''Figure 3: F1-Score'''</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from the UCI Machine Learning Repository. DROCC outperforms the baselines on all three datasets by a minimum of 0.07 which is about an 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">'''Figure 4: Sample positives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).'''</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on the MNIST dataset, where the training data consists of Digit 0, the normal class, and Digit 1 as the anomaly. During the evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">'''Figure 5: OCLN on Audio Commands.'''</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as Mar, Vin, Marvelous, etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold and hence can compare close points via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via a standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming the positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set of both methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volcanic explosion from seismic activity data. Such challenging anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
[9]: Liu, F. T., Ting, K. M., and Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self-driving cars, for instance, to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.<br />
<br />
3.In the introduction part, it should first explain what is "one class", and then make a detailed application. Moreover, special definition words are used in many places in the text. No detailed explanation was given. In the end, the future application fields of DROCC and the research direction of the group can be explained.<br />
<br />
4. The geometry of this technique (classification based on incidence outside of a ball centred at a known point <math>x</math>) sounds quite similar to K-nearest neighbors. While the authors compared DROCC to single-nearest neighbor classification, choosing a higher K would result in a stronger, more regularized model. It would be interesting to see how DROCC compares to the general KNN classifier<br />
<br />
5. This is a nice summary and the authors introduce clearly on the performance of DROCC. It is nice to use Alexa as an example to catch readers' attention. I think it will be nice to include the algorithm of the DROCC or the architecture of DROCC in this summary to help us know the whole view of this method. Maybe it will be interesting to apply DROCC in biomedical studies? since one-class classification is often used in biomedical studies.<br />
<br />
6. The training method resembles adversarial learning with gradient ascent, however, there is no evaluation of this method on adversarial examples. This is quite unusual considering the paper proposed a method for robust one-class classification, and can be a security threat in real life in critical applications.<br />
<br />
7. The underlying idea behind OCLN is very similar to how neural networks are implemented in recommender systems and trained over positive/negative triplet models. In that case as well, due to the nature of implicit and explicit feedback, positive data tends to dominate the system. It would be interesting to see if insights from that area could be used to further boost the model presented in this paper.<br />
<br />
8. The paper shows the performance of DROCC being evaluated for time series data. It is interesting to see high AUC scores for DROCC against baselines like nearest neighbours and REBMs.Because detecting abnormal data in time series datasets is not common to practise.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=47389User:T358wang2020-11-28T20:59:12Z<p>Y2587wan: /* Critique */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
'''1.''' The concept of landmarks is not strictly defined. Landmarks can take various forms including objects and buildings.<br />
<br />
'''2.''' The same landmark can be photographed from different angles. Certain angles may capture the interior of a building as opposed to its exterior. This could result in vastly different picture characteristics between angles. A good model should accurately identify landmarks regardless of perspective.<br />
<br />
'''3.''' The dataset is unbalanced. The majority of objects fall into the single class of "not landmarks", while relatively few images exist for each class of landmark. Hence, it will be challenging to obtain both a low false positive rate as well as a high recognition accuracy between classes of landmarks.<br />
<br />
There are also three potential problems:<br />
<br />
'''1.''' The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
'''2.''' The algorithm for learning the training set must be fast and scalable.<br />
<br />
'''3.''' While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents concentrate on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), CNNs were used to generate global descriptors of input images.<br />
<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts: the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. For example, if the landmark we are interested in is a palace which located on a city square, then images of a similar building on the same square are included in the data which can affect the centroids. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{|C_j|} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L and v=1 if there is no valid clusters for L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
We will now look at how the landmark database was created. The collection process was structured by countries, cities and landmarks. They divided the world into several regions: Europe, America, Middle East, Africa, Far East, Australia and Oceania. Within each region, cities were selected that contained a lot of significant landmarks, and some natural landmarks were filtered out as they are difficult to distinguish. Once the cities and landmarks were selected, both images and meta data was collected for each landmark.<br />
<br />
[[File:landmarkcleaning.png | center | 400px]]<br />
<br />
After forming the database, it had to be cleaned before it could be used to train the CNN. First, for each landmark, any redundant images were removed. Then for each landmark 5 images were picked that had a high probability of the containing the landmark and were checked manually. The database was then cleaned by parts using the curriculum learning process. It is further described in the pseudocode above. The final database contained 11381 landmarks in 503 cities and 70 countries. With 2331784 landmark images and 900000 non-landmark images. The number of landmarks that have less than 100 images is called "rare".<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
<div style="text-align:center;"> '''Revisited Paris Medium''' </div><br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
<br />
<div style="text-align:center;"> '''Revisited Paris Hard''' </div><br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
In addition, researchers often look for reliability and reproducibility. By using a private database and manually labelling it, it lends itself to an array of issues in terms of validity and integrity. Researchers who are looking for such an algorithm will not be able to sufficiently determine if the experiments do actually yield the claimed results. Also, manual labelling by those who are related to the individuals conducting this research also raises the question of conflict of interest. The primary experiment of this paper should be on a public and third-party dataset.<br />
<br />
It might be worth looking into the ability to generalize better. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognization.<br />
<br />
This summary has a good structure and the methodology part is very clear for readers to understand. Using some diagrams for the comparison with other methods is good for visualization for readers. Since the dataset is marked manually, so it is kind of time-consuming for training a model. So it might be interesting to discuss how the famous IT company (i.e. Google etc.) fix this problem.<br />
<br />
It would be beneficial if the authors could provide more explanations regarding the DELF method. Visualization of the differences between DELF and CNN from an algorithm and architecture perspective would be highly significant for the context of this paper.<br />
<br />
One challenge of landmark recognition is large number of classes. It would be good to see the comparison between the proposed model and other models in terms of efficiency.<br />
<br />
The scope of this paper seems to work specifically with some of the most well known landmarks in the world, and many of these landmarks are well known because they are very distinct in how they look. It would be interesting to see how well the model works when classifying different landmarks of similar type (ie, Notre Dame Cathedral vs. St. Paul's Cathedral, etc.). It would also be interesting to see how this model compares with other models in literature, or if this is unique, perhaps the authors could scale this model down to a landmark classification problem (castles, churches, parks, etc.) and compare against other models that way.<br />
<br />
Paper 25 (Loss Function Search in Facial Recognition) also utilizes the softmax loss function in feature discrimination in images. The difference between this paper and paper 25 is that this paper focuses on landmark images, whereas paper 25 is for facial recognition. Despite the slightly different application, both papers prove the importance of using the softmax loss function in feature discrimination, which is pretty neat.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=47388User:T358wang2020-11-28T20:57:17Z<p>Y2587wan: /* Methodology */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
'''1.''' The concept of landmarks is not strictly defined. Landmarks can take various forms including objects and buildings.<br />
<br />
'''2.''' The same landmark can be photographed from different angles. Certain angles may capture the interior of a building as opposed to its exterior. This could result in vastly different picture characteristics between angles. A good model should accurately identify landmarks regardless of perspective.<br />
<br />
'''3.''' The dataset is unbalanced. The majority of objects fall into the single class of "not landmarks", while relatively few images exist for each class of landmark. Hence, it will be challenging to obtain both a low false positive rate as well as a high recognition accuracy between classes of landmarks.<br />
<br />
There are also three potential problems:<br />
<br />
'''1.''' The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
'''2.''' The algorithm for learning the training set must be fast and scalable.<br />
<br />
'''3.''' While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents concentrate on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), CNNs were used to generate global descriptors of input images.<br />
<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts: the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. For example, if the landmark we are interested in is a palace which located on a city square, then images of a similar building on the same square are included in the data which can affect the centroids. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{|C_j|} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L and v=1 if there is no valid clusters for L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
We will now look at how the landmark database was created. The collection process was structured by countries, cities and landmarks. They divided the world into several regions: Europe, America, Middle East, Africa, Far East, Australia and Oceania. Within each region, cities were selected that contained a lot of significant landmarks, and some natural landmarks were filtered out as they are difficult to distinguish. Once the cities and landmarks were selected, both images and meta data was collected for each landmark.<br />
<br />
[[File:landmarkcleaning.png | center | 400px]]<br />
<br />
After forming the database, it had to be cleaned before it could be used to train the CNN. First, for each landmark, any redundant images were removed. Then for each landmark 5 images were picked that had a high probability of the containing the landmark and were checked manually. The database was then cleaned by parts using the curriculum learning process. It is further described in the pseudocode above. The final database contained 11381 landmarks in 503 cities and 70 countries. With 2331784 landmark images and 900000 non-landmark images. The number of landmarks that have less than 100 images is called "rare".<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
<div style="text-align:center;"> '''Revisited Paris Medium''' </div><br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
<br />
<div style="text-align:center;"> '''Revisited Paris Hard''' </div><br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
In addition, researchers often look for reliability and reproducibility. By using a private database and manually labelling it, it lends itself to an array of issues in terms of validity and integrity. Researchers who are looking for such an algorithm will not be able to sufficiently determine if the experiments do actually yield the claimed results. Also, manual labelling by those who are related to the individuals conducting this research also raises the question of conflict of interest. The primary experiment of this paper should be on a public and third-party dataset.<br />
<br />
It might be worth looking into the ability to generalize better. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognization.<br />
<br />
This summary has a good structure and the methodology part is very clear for readers to understand. Using some diagrams for the comparison with other methods is good for visualization for readers. Since the dataset is marked manually, so it is kind of time-consuming for training a model. So it might be interesting to discuss how the famous IT company (i.e. Google etc.) fix this problem.<br />
<br />
It would be beneficial if the authors could provide more explanations regarding the DELF method. Visualization of the differences between DELF and CNN from an algorithm and architecture perspective would be highly significant for the context of this paper.<br />
<br />
One challenge of landmark recognition is large number of classes. It would be good to see the comparison between the proposed model and other models in terms of efficiency.<br />
<br />
The scope of this paper seems to work specifically with some of the most well known landmarks in the world, and many of these landmarks are well known because they are very distinct in how they look. It would be interesting to see how well the model works when classifying different landmarks of similar type (ie, Notre Dame Cathedral vs. St. Paul's Cathedral, etc.). It would also be interesting to see how this model compares with other models in literature, or if this is unique, perhaps the authors could scale this model down to a landmark classification problem (castles, churches, parks, etc.) and compare against other models that way.<br />
<br />
Paper 25 (Loss Function Search in Facial Recognition) also utilizes the softmax loss function in feature discrimination in images. The difference between this paper and paper 25 is that this paper focuses on landmark images, whereas paper 25 is for facial recognition. Despite the slightly different application, both papers prove that importance of using the softmax loss function in feature discrimination, which is pretty neat.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=47387User:T358wang2020-11-28T20:55:19Z<p>Y2587wan: /* Introduction */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
'''1.''' The concept of landmarks is not strictly defined. Landmarks can take various forms including objects and buildings.<br />
<br />
'''2.''' The same landmark can be photographed from different angles. Certain angles may capture the interior of a building as opposed to its exterior. This could result in vastly different picture characteristics between angles. A good model should accurately identify landmarks regardless of perspective.<br />
<br />
'''3.''' The dataset is unbalanced. The majority of objects fall into the single class of "not landmarks", while relatively few images exist for each class of landmark. Hence, it will be challenging to obtain both a low false positive rate as well as a high recognition accuracy between classes of landmarks.<br />
<br />
There are also three potential problems:<br />
<br />
'''1.''' The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
'''2.''' The algorithm for learning the training set must be fast and scalable.<br />
<br />
'''3.''' While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents concentrate on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words(A bag of words model is defined as a simplified representation of the text information by retrieving only the significant words in a sentence or paragraph while disregarding its grammar. The bag of words approach is commonly used in classification tasks where the words are used as features in the model-training), spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), CNNs were used to generate global descriptors of input images.<br />
<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts: the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture. Fine-tuning is a very efficient technique in various computer vision applications because we can take advantage of everything the model has already learned and applied it to our specific task.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter, and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East Regions. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. For example, if the landmark we are interested in is a palace which located on a city square, then images of a similar building on the same square are included in the data which can affect the centroids. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{|C_j|} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L and v=1 if there is no valid clusters for L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
We will now look at how the landmark database was created. The collection process was structured by countries, cities and landmarks. They divided the world into several regions: Europe, America, Middle East, Africa, Far East, Australia and Oceania. Within each region, cities were selected that contained a lot of significant landmarks, and some natural landmarks were filtered out as they are difficult to distinguish. Once the cities and landmarks were selected, both images and meta data was collected for each landmark.<br />
<br />
[[File:landmarkcleaning.png | center | 400px]]<br />
<br />
After forming the database, it had to be cleaned before it could be used to train the CNN. First, for each landmark any redundant images were removed. Then for each landmark 5 images were picked that had a high probability of the containing the landmark and were checked manually. The database was then cleaned by parts using the curriculum learning process. It is further described in the pseudocode above. The final database contained 11381 landmarks in 503 cities and 70 countries. With 2331784 landmark images and 900000 non-landmark images. The number of landmarks that have less than 100 images are called "rare".<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks is much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
<div style="text-align:center;"> '''Revisited Paris Medium''' </div><br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
<br />
<div style="text-align:center;"> '''Revisited Paris Hard''' </div><br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties that emerge when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be most useful in big cities instead of smaller cities with less popular landmarks.<br />
<br />
In addition, researchers often look for reliability and reproducibility. By using a private database and manually labelling it, it lends itself to an array of issues in terms of validity and integrity. Researchers who are looking for such an algorithm will not be able to sufficiently determine if the experiments do actually yield the claimed results. Also, manual labelling by those who are related to the individuals conducting this research also raises the question of conflict of interest. The primary experiment of this paper should be on a public and third-party dataset.<br />
<br />
It might be worth looking into the ability to generalize better. <br />
<br />
This is a very interesting implementation in some specific field. The paper shows a process to analyze the problem and trains the model based on deep CNN implementation. In future work, it would be some practical advice to compare the deep CNN model with other models. By comparison, we might receive a more comprehensive training model for landmark recognization.<br />
<br />
This summary has a good structure and the methodology part is very clear for readers to understand. Using some diagrams for the comparison with other methods is good for visualization for readers. Since the dataset is marked manually, so it is kind of time-consuming for training a model. So it might be interesting to discuss how the famous IT company (i.e. Google etc.) fix this problem.<br />
<br />
It would be beneficial if the authors could provide more explanations regarding the DELF method. Visualization of the differences between DELF and CNN from an algorithm and architecture perspective would be highly significant for the context of this paper.<br />
<br />
One challenge of landmark recognition is large number of classes. It would be good to see the comparison between the proposed model and other models in terms of efficiency.<br />
<br />
The scope of this paper seems to work specifically with some of the most well known landmarks in the world, and many of these landmarks are well known because they are very distinct in how they look. It would be interesting to see how well the model works when classifying different landmarks of similar type (ie, Notre Dame Cathedral vs. St. Paul's Cathedral, etc.). It would also be interesting to see how this model compares with other models in literature, or if this is unique, perhaps the authors could scale this model down to a landmark classification problem (castles, churches, parks, etc.) and compare against other models that way.<br />
<br />
Paper 25 (Loss Function Search in Facial Recognition) also utilizes the softmax loss function in feature discrimination in images. The difference between this paper and paper 25 is that this paper focuses on landmark images, whereas paper 25 is for facial recognition. Despite the slightly different application, both papers prove that importance of using the softmax loss function in feature discrimination, which is pretty neat.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=47384Task Understanding from Confusing Multi-task Data2020-11-28T20:51:35Z<p>Y2587wan: /* Critique */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. While, in fact, all input-label pairs come from a unified distribution, and this distribution is estimated by a mixture of multiple probability models. Thus, Due to the lack of task information, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk function. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth. <br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well in learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed to converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low-level representations, and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem, if it was investigated.<br />
<br />
It would be a good idea to include comparison details in the summary to make the results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.<br />
<br />
It is interesting to investigate if the order of adding tasks will influence the model performance.<br />
<br />
It would be interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to an identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.<br />
<br />
[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=47382Task Understanding from Confusing Multi-task Data2020-11-28T20:50:14Z<p>Y2587wan: /* Results */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. While, in fact, all input-label pairs come from a unified distribution, and this distribution is estimated by a mixture of multiple probability models. Thus, Due to the lack of task information, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk function. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth. <br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well in learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low level representations, and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem, if it was investigated.<br />
<br />
It would be a good idea to include comparison details in the summary to make results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.<br />
<br />
It is interesting to investigate if the order of adding tasks will influence the model performance.<br />
<br />
It would be an interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to an identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.<br />
<br />
[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data&diff=47381Task Understanding from Confusing Multi-task Data2020-11-28T20:46:09Z<p>Y2587wan: /* Description of the Problem */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, and even self-driving cars. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input value. Researchers manually assign tasks to each input in the sample to train the machine. See figure 1(a). This method incurs high annotation costs and restricts the machine’s ability to mirror the human recognition process. This paper is interested in developing an algorithm that understands task concepts and performs multi-task learning without manual task annotations. <br />
<br />
This paper proposed a new learning method called confusing supervised learning (CSL) which includes 2 functions: de-confusing function and mapping function. The first function allocates identifies an input to its respective task and the latter finds the relationship between the input and its label. See figure 1(b). To train a network of CSL, CSL-Net is constructed for representing CSL’s variables. However, this structure cannot be optimized by gradient back-propagation. This difficulty is solved by alternatively performing training for the de-confusing net and mapping net optimization. <br />
<br />
Experiments for function regression and image recognition problems were constructed and compared with multi-task learning with complete information to test CSL-Net’s performance. Experiment results show that CSL-Net can learn multiple mappings for every task simultaneously and achieve the same cognition result as the current multi-task machine sigh complete information.<br />
<br />
= Related Work =<br />
<br />
[[File:CSLFigure2.PNG | 700px]]<br />
<br />
==Multi-task learning==<br />
Multi-task learning aims to learn multiple tasks simultaneously using a shared feature representation. In multi-task learning, the task to which every sample belongs is known. By exploiting similarities and differences between tasks, the learning from one task can improve the learning of another task. (Caruana, 1997) This results in improved learning efficiency. Multi-task learning is used in disciplines like computer vision, natural language processing, and reinforcement learning. In multi-task learning, the task to which every sample belongs is known. With this task definition, the input-output mapping of every task can be represented by a unified function. However, these task definitions are manually constructed, and machines need manual task annotations to learn. Without this annotation, our goal is to understand the task concept from confusing input-label pairs. Overall, It requires manual task annotation to learn and this paper is interested in machine learning without a clear task definition and manual task annotation.<br />
<br />
==Latent variable learning==<br />
Latent variable learning aims to estimate the true function with mixed probability models. See '''figure 2a'''. In the multi-task learning problem without task annotations, samples are generated from multiple distributions instead of one distribution. While, in fact, all input-label pairs come from a unified distribution, and this distribution is estimated by a mixture of multiple probability models. Thus, Due to the lack of task information, latent variable learning is insufficient to solve the research problem, which is multi-task confusing samples.<br />
<br />
==Multi-label learning==<br />
Multi-label learning aims to assign an input to a set of classes/labels. See '''figure 2b'''. It is a generalization of multi-class classification, which classifies an input into one class. In multi-label learning, an input can be classified into more than one class. Unlike multi-task learning, multi-label does not consider the relationship between different label judgments and it is assumed that each judgment is independent. An example where multi-label learning is applicable is the scenario where a website wants to automatically assign applicable tags/categories to an article. Since an article can be related to multiple categories (eg. an article can be tagged under the politics and business categories) multi-label learning is of primary concern here.<br />
<br />
= Confusing Supervised Learning =<br />
<br />
== Description of the Problem ==<br />
<br />
Confusing supervised learning (CSL) offers a solution to the issue at hand. A major area of improvement can be seen in the choice of risk measure. In traditional supervised learning, let <math> (x,y)</math> be the training samples from <math>y=f(x)</math>, which is an identical but unknown mapping relationship. Assuming the risk measure is mean squared error (MSE), the expected risk function is<br />
<br />
$$ R(g) = \int_x (f(x) - g(x))^2 p(x) \; \mathrm{d}x $$<br />
<br />
where <math>p(x)</math> is the data distribution of the input variable <math>x</math>. In practice, model optimizations are performed using the empirical risk<br />
<br />
$$ R_e(g) = \sum_{i=1}^n (y_i - g(x_i))^2 $$<br />
<br />
When the problem involves different tasks, the model should optimize for each data point depending on the given task. Let <math>f_j(x)</math> be the true ground-truth function for each task <math> j </math>. Therefore, for some input variable <math> x_i </math>, an ideal model <math>g</math> would predict <math> g(x_i) = f_j(x_i) </math>. With this, the risk function can be modified to fit this new task for traditional supervised learning methods.<br />
<br />
$$ R(g) = \int_x \sum_{j=1}^n (f_j(x) - g(x))^2 p(f_j) p(x) \; \mathrm{d}x $$<br />
<br />
We call <math> (f_j(x) - g(x))^2 p(f_j) </math> the '''confusing multiple mappings'''. Then the optimal solution <math>g^*(x)</math> to the mapping is <math>\bar{f}(x) = \sum_{j=1}^n p(f_j) f_j(x)</math> under this risk function. However, the optimal solution is not conditional on the specific task at hand but rather on the entire ground-truth functions. Therefore, for every non-trivial set of tasks where <math>f_u(x) \neq f_v(x)</math> for some input <math>x</math> and <math>u \neq v</math>, <math>R(g^*) > 0</math> which implies that there is an unavoidable confusion risk.<br />
<br />
== Learning Functions of CSL ==<br />
<br />
To overcome this issue, the authors introduce two types of learning functions:<br />
* '''Deconfusing function''' &mdash; allocation of which samples come from the same task<br />
* '''Mapping function''' &mdash; mapping relation from input to the output of every learned task<br />
<br />
Suppose there are <math>n</math> ground-truth mappings <math>\{f_j : 1 \leq j \leq n\}</math> that we wish to approximate with a set of mapping functions <math>\{g_k : 1 \leq k \leq l\}</math>. The authors define the deconfusing function as an indicator function <math>h(x, y, g_k) </math> which takes some sample <math>(x,y)</math> and determines whether the sample is assigned to task <math>g_k</math>. Under the CSL framework, the risk functional (mean squared loss) is <br />
<br />
$$ R(g,h) = \int_x \sum_{j,k} (f_j(x) - g_k(x))^2 \; h(x, f_j(x), g_k) \;p(f_j) \; p(x) \;\mathrm{d}x $$<br />
<br />
which can be estimated empirically with<br />
<br />
$$R_e(g,h) = \sum_{i=1}^m \sum_{k=1}^n |y_i - g_k(x_i)|^2 \cdot h(x_i, y_i, g_k) $$<br />
<br />
== Theoretical Results ==<br />
<br />
This novel framework yields some theoretical results to show the viability of its construction.<br />
<br />
'''Theorem 1 (Existence of Solution)'''<br />
''With the confusing supervised learning framework, there is an optimal solution''<br />
$$h^*(x, f_j(x), g_k) = \mathbb{I}[j=k]$$<br />
<br />
$$g_k^*(x) = f_k(x)$$<br />
<br />
''for each <math>k=1,..., n</math> that makes the expected risk function of the CSL problem zero.''<br />
<br />
'''Theorem 2 (Error Bound of CSL)'''<br />
''With probability at least <math>1 - \eta</math> simultaneously with finite VC dimension <math>\tau</math> of CSL learning framework, the risk measure is bounded by<br />
<br />
$$R(\alpha) \leq R_e(\alpha) + \frac{B\epsilon(m)}{2} \left(1 + \sqrt{1 + \frac{4R_e(\alpha)}{B\epsilon(m)}}\right)$$<br />
<br />
''where <math>\alpha</math> is the total parameters of learning functions <math>g, h</math>, <math>B</math> is the upper bound of one sample's risk, <math>m</math> is the size of training data and''<br />
$$\epsilon(m) = 4 \; \frac{\tau (\ln \frac{2m}{\tau} + 1) - \ln \eta / 4}{m}$$<br />
<br />
= CSL-Net =<br />
In this section, the authors describe how to implement and train a network for CSL.<br />
<br />
== The Structure of CSL-Net ==<br />
Two neural networks, deconfusing-net and mapping-net are trained to implement two learning function variables in empirical risk. The optimization target of the training algorithm is:<br />
$$\min_{g, h} R_e = \sum_{i=1}^{m}\sum_{k=1}^{n} (y_i - g_k(x_i))^2 \cdot h(x_i, y_i; g_k)$$<br />
<br />
The mapping-net is corresponding to functions set <math>g_k</math>, where <math>y_k = g_k(x)</math> represents the output of one certain task. The deconfusing-net is corresponding to function h, whose input is a sample <math>(x,y)</math> and output is an n-dimensional one-hot vector. This output vector determines which task the sample <math>(x,y)</math> should be assigned to. The core difficulty of this algorithm is that the risk function cannot be optimized by gradient back-propagation due to the constraint of one-hot output from deconfusing-net. Approximation of softmax will lead the deconfusing-net output into a non-one-hot form, which results in meaningless trivial solutions.<br />
<br />
== Iterative Deconfusing Algorithm ==<br />
To overcome the training difficulty, the authors divide the empirical risk minimization into two local optimization problems. In each single-network optimization step, the parameters of one network are updated while the parameters of another remain fixed. With one network's parameters unchanged, the problem can be solved by a gradient descent method of neural networks. <br />
<br />
'''Training of Mapping-Net''': With function h from deconfusing-net being determined, the goal is to train every mapping function <math>g_k</math> with its corresponding sample <math>(x_i^k, y_i^k)</math>. The optimization problem becomes: <math>\displaystyle \min_{g_k} L_{map}(g_k) = \sum_{i=1}^{m_k} \mid y_i^k - g_k(x_i^k)\mid^2</math>. Back-propagation algorithm can be applied to solve this optimization problem.<br />
<br />
'''Training of Deconfusing-Net''': The task allocation is re-evaluated during the training phase while the parameters of the mapping-net remain fixed. To minimize the original risk, every sample <math>(x, y)</math> will be assigned to <math>g_k</math> that is closest to label y among all different <math>k</math>s. Mapping-net thus provides a temporary solution for deconfusing-net: <math>\hat{h}(x_i, y_i) = arg \displaystyle\min_{k} \mid y_i - g_k(x_i)\mid^2</math>. The optimization becomes: <math>\displaystyle \min_{h} L_{dec}(h) = \sum_{i=1}^{m} \mid {h}(x_i, y_i) - \hat{h}(x_i, y_i)\mid^2</math>. Similarly, the optimization problem can be solved by updating the deconfusing-net with a back-propagation algorithm.<br />
<br />
The two optimization stages are carried out alternately until the solution converges.<br />
<br />
=Experiment=<br />
==Setup==<br />
<br />
3 data sets are used to compare CSL to existing methods, 1 function regression task, and 2 image classification tasks. <br />
<br />
'''Function Regression''': The function regression data comes in the form of <math>(x_i,y_i),i=1,...,m</math> pairs. However, unlike typical regression problems, there are multiple <math>f_j(x),j=1,...,n</math> mapping functions, so the goal is to recover both the mapping functions <math>f_j</math> as well as determine which mapping function corresponds to each of the <math>m</math> observations. 3 scalar-valued, scalar-input functions that intersect at several points with each other have been chosen as the different tasks. <br />
<br />
'''Colorful-MNIST''': The first image classification data set consists of the MNIST digit data that has been colored. Each observation in this modified set consists of a colored image (<math>x_i</math>) and either the color, or the digit it represents (<math>y_i</math>). The goal is to recover the classification task ("color" or "digit") for each observation and construct the 2 classifiers for both tasks. <br />
<br />
'''Kaggle Fashion Product''': This data set has more observations than the "colored-MNIST" data and consists of pictures labeled with either the “Gender”, “Category”, and “Color” of the clothing item.<br />
<br />
==Use of Pre-Trained CNN Feature Layers==<br />
<br />
In the Kaggle Fashion Product experiment, CSL trains fully-connected layers that have been attached to feature-identifying layers from pre-trained Convolutional Neural Networks.<br />
<br />
==Metrics of Confusing Supervised Learning==<br />
<br />
There are two measures of accuracy used to evaluate and compare CSL to other methods, corresponding respectively to the accuracy of the task labeling and the accuracy of the learned mapping function. <br />
<br />
'''Task Prediction Accuracy''': <math>\alpha_T(j)</math> is the average number of times the learned deconfusing function <math>h</math> agrees with the task-assignment ability of humans <math>\tilde h</math> on whether each observation in the data "is" or "is not" in task <math>j</math>.<br />
<br />
$$ \alpha_T(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m I[h(x_i,y_i;f_k),\tilde h(x_i,y_i;f_j)]$$<br />
<br />
The max over <math>k</math> is taken because we need to determine which learned task corresponds to which ground-truth task.<br />
<br />
'''Label Prediction Accuracy''': <math>\alpha_L(j)</math> again chooses <math>f_k</math>, the learned mapping function that is closest to the ground-truth of task <math>j</math>, and measures its average absolute accuracy compared to the ground-truth of task <math>j</math>, <math>f_j</math>, across all <math>m</math> observations.<br />
<br />
$$ \alpha_L(j) = \operatorname{max}_k\frac{1}{m}\sum_{i=1}^m 1-\dfrac{|g_k(x_i)-f_j(x_i)|}{|f_j(x_i)|}$$<br />
<br />
The purpose of this measure arises from the fact that, in addition to learning mapping allocations like humans, machines should be able to approximate all mapping functions accurately in order to provide corresponding labels. The Label Prediction Accuracy measure captures the exchange equivalence of the following task: each mapping contains its ground-truth output, and machines should be predicting the correct output that is close to the ground-truth. <br />
<br />
==Results==<br />
<br />
Given confusing data, CSL performs better than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017). This is demonstrated by CSL's <math>\alpha_L</math> scores of around 95%, compared to <math>\alpha_L</math> scores of under 50% for the other methods. This supports the assertion that traditional methods only learn the means of all the ground-truth mapping functions when presented with confusing data.<br />
<br />
'''Function Regression''': In order to "correctly" partition the observations into the correct tasks, a 5-shot warm-up was used. In this situation, the CSL methods work well learning the ground-truth. That means the initialization of the neural network is set up properly.<br />
<br />
'''Image Classification''': Visualizations created through Spectral embedding confirm the task labeling proficiency of the deconfusing neural network <math>h</math>.<br />
<br />
The classification and function prediction accuracy of CSL are comparable to supervised learning programs that have been given access to the ground-truth labels.<br />
<br />
==Application of Multi-label Learning==<br />
<br />
CSL also had better accuracy than traditional supervised learning methods, Pseudo-Label(Lee, 2013), and SMiLE(Tan et al., 2017) when presented with partially labelled multi-label data <math>(x_i,y_i)</math>, where <math>y_i</math> is a <math>n</math>-long indicator vector for whether the image <math>(x_i,y_i)</math> corresponds to each of the <math>n</math> labels.<br />
<br />
Applications of multi-label classification include building a recommendation system, social media targeting, as well as detecting adverse drug reaction from text.<br />
<br />
Multi-label can be used to improve the syndrome diagnosis of a patient by focusing on multiple syndromes instead of a single syndrome.<br />
<br />
==Limitations==<br />
<br />
'''Number of Tasks''': The number of tasks is determined by increasing the task numbers progressively and testing the performance. Ideally, a better way of deciding the number of tasks is expected rather than increasing it one by one and seeing which is the minimum number of tasks that gives the smallest risk. Adding low-quality constraints to deconfusing-net is a reasonable solution to this problem.<br />
<br />
'''Learning of Basic Features''': The CSL framework is not good at learning features. So far, a pre-trained CNN backbone is needed for complicated image classification problems. Even though the effectiveness of the proposed algorithm in learning confusing data based on pre-trained features hasn't been affected, the full-connect network can only be trained based on learned CNN features. It is still a challenge for the current algorithm to learn basic features directly through a CNN structure and understand tasks simultaneously.<br />
<br />
= Conclusion =<br />
<br />
This paper proposes the CSL method for tackling the multi-task learning problem with manual task annotations in the input data. The model obtains a basic task concept by differentiating multiple mappings. The paper also demonstrates that the CSL method is an important step to moving from Narrow AI towards General AI for multi-task learning.<br />
<br />
However, there are some limitations that can be improved for future work:<br />
The repeated training process of determining the lowest best task number that has the closest to zero causes inefficiency in the learning process; The current algorithm is difficult to learn basic features directly through the CNN structure, and it is necessary to learn and train a fully connected network based on CNN features in the experiment.<br />
<br />
= Critique =<br />
<br />
The classification accuracy of CSL was made with algorithms not designed to deal with confusing data and which do not first classify the task of each observation.<br />
<br />
Human task annotation is also imperfect, so one additional application of CSL may be to attempt to flag task annotation errors made by humans, such as in sorting comments for items sold by online retailers; concerned customers, in particular, may not correctly label their comments as "refund", "order didn't arrive", "order damaged", "how good the item is" etc.<br />
<br />
This algorithm will also have a huge issue in scaling, as the proposed method requires repeated training processes, so it might be too expensive for researchers to implement and improve on this algorithm.<br />
<br />
This research paper should have included a plot on loss (of both functions) against epochs in the paper. A common issue with fixing the parameters of one network and updating the other is the variability during training. This is prevalent in other algorithms with similar training methods such as generative adversarial networks (GAN). For instance, ''mode collapse'' is the issue of one network stuck in local minima and other networks that rely on this network may receive incorrect signals during backpropagation. In the case of CSL-Net, since the Deconfusing-Net directly relies on Mapping-Net for training labels, if the Mapping-Net is unable to sufficiently converge, the Deconfusing-Net may incorrectly learn the mapping from inputs to the task. For data with high noise, oscillations may severely prolong the time needed for converge because of the strong correlation in prediction between the two networks.<br />
<br />
- It would be interesting to see this implemented in more examples, to test the robustness of different types of data.<br />
<br />
Even though this paper has already included some examples when testing the CSL in experiments, it will be better to include more detailed examples for partial-label in the "Application of Multi-label Learning" section.<br />
<br />
When using this framework for classification, the order of the one-hot classification labels for each task will likely influence the relationships learned between each task, since the same output header is used for all tasks. This may be why this method fails to learn low level representations, and requires pretraining. I would like to see more explanation in the paper about why this isn't a problem, if it was investigated.<br />
<br />
It would be a good idea to include comparison details in the summary to make results and the conclusion more convincing. For instance, though the paper introduced the result generated using confusion data, and provide some applications for multi-label learning, these two sections still fell short and could use some technical details as supporting evidence.<br />
<br />
It is interesting to investigate if the order of adding tasks will influence the model performance.<br />
<br />
It would be an interesting to see the effectiveness of applying CSL in face recognition, such that not only does the algorithm map the face to an identity, it also categorizes the face based on other features like beard/no beard and glasses/no glasses simultaneously.<br />
<br />
= References =<br />
<br />
[1] Su, Xin, et al. "Task Understanding from Confusing Multi-task Data."<br />
<br />
[2] Caruana, R. (1997) "Multi-task learning"<br />
<br />
[3] Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML, vol. 3, 2013, pp. 2–8. <br />
<br />
[4] Tan, Q., Yu, Y., Yu, G., and Wang, J. Semi-supervised multi-label classification using incomplete label information. Neurocomputing, vol. 260, 2017, pp. 192–202.<br />
<br />
[5] Chavdarova, Tatjana, and François Fleuret. "Sgan: An alternative training of generative adversarial networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9407-9415. 2018.<br />
<br />
[6] Guo-Ping Liu, Jian-Jun Yan, Yi-Qin Wang, Jing-Jing Fu, Zhao-Xia Xu, Rui Guo, Peng Qian, "Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis", Evidence-Based Complementary and Alternative Medicine, vol. 2012, Article ID 135387, 9 pages, 2012. https://doi.org/10.1155/2012/135387</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=47379Semantic Relation Classification——via Convolution Neural Network2020-11-28T20:42:44Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE.<br />
<br />
SemEval 2018 Task 7 extracted data from 350 scientific paper abstracts, which has 1228 and 1248 annotated sentences for two tasks, respectively. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). In the end, the prediction based on the CNN model was finally submitted since it performed the best among all models. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8 (Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis, and their challenge differs from the challenge of this paper in its generalizability; this paper’s relations are specific to ACL papers (e.g. MODEL-FEATURE), whereas the 2010 relations are more general, and might necessitate more common-sense knowledge than the 2018 relations. Xu et al. (2015a) and Santos et al. (2015) both applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px|center]]<br />
<br />
This is the architecture of CNN. We first transform a sentence via Feature embeddings. Basically, we transform each sentence into continuous word embeddings:<br />
<br />
$$<br />
(e^{w_i})<br />
$$<br />
<br />
And word position embeddings:<br />
$$<br />
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]<br />
$$<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
are given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor is produced:<br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some<br />
modifications to the model in order to increase performance on the test set. There are 5 modifications that we can apply:<br />
<br />
'''1.''' Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
'''2.''' Reversal Indicate Features. It added a binary feature.<br />
<br />
'''3.''' Custom ACL Embeddings. It embedded a word vector to an ACL-specific<br />
corps.<br />
<br />
'''4.''' Context words. Within the sentence, it varies in size on a context window<br />
around the entity-enclosed text.<br />
<br />
'''5.''' Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in table 3.<br />
<br />
[[File:table3.PNG|center]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more natural and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increase the score on the cross-validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG|center]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for the training set 1.1 is the third submission which did not<br />
use the cross-validation as the test set. Instead, it runs a constant number of<br />
training epochs, and it can be chosen by cross-validation based on the training data. The best submission for the training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, early stopping cannot always be based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly improve the performance of<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM, and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significantly better performance than all others. Attention-based pooling, up-sampling, and data augmentation are also tested, but they barely perform positive increment on the behavior.<br />
<br />
== Critiques == <br />
<br />
- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.<br />
<br />
- In the section of previous work, the author mentioned 9 natural language relationships between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.<br />
<br />
-This topic is interesting and this application might be helpful for some educational websites to improve their website to help readers focus on the important points. I think it will be nice to use Latex to type the equation in the sentence rather than center the equation on the next line. I think it will be interesting to discuss applying this way to other languages such as Chinese, Japanese, etc.<br />
<br />
- It would be a good idea if the authors can provide more details regarding ACL Embeddings and Context words modifications. Scores generated using these two modifications are quite close to the highest Ensembling modification generated score, which makes it a valid consideration to examine these two modifications in detail. <br />
<br />
- This paper is dealing with a similar problem as 'Neural Speed Reading Via Skim-RNN', num 19 paper summary. It will be an interesting approach to compare these two models' performance based on the same dataset.<br />
<br />
- I think it would be highly practical to implement this system as a page-rank system for search engines (such as google, bing, or other platforms like Facebook, Instagram, etc.) by finding the most prevalent information available in a search query and then matching the search to the related text which can be found on webpages. This could also be implemented in search bars on specific websites or locations as well.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=47378Semantic Relation Classification——via Convolution Neural Network2020-11-28T20:41:46Z<p>Y2587wan: /* Results */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE.<br />
<br />
SemEval 2018 Task 7 extracted data from 350 scientific paper abstracts, which has 1228 and 1248 annotated sentences for two tasks, respectively. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). In the end, the prediction based on the CNN model was finally submitted since it performed the best among all models. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8 (Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis, and their challenge differs from the challenge of this paper in its generalizability; this paper’s relations are specific to ACL papers (e.g. MODEL-FEATURE), whereas the 2010 relations are more general, and might necessitate more common-sense knowledge than the 2018 relations. Xu et al. (2015a) and Santos et al. (2015) both applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px|center]]<br />
<br />
This is the architecture of CNN. We first transform a sentence via Feature embeddings. Basically, we transform each sentence into continuous word embeddings:<br />
<br />
$$<br />
(e^{w_i})<br />
$$<br />
<br />
And word position embeddings:<br />
$$<br />
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]<br />
$$<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
are given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor is produced:<br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some<br />
modifications to the model in order to increase performance on the test set. There are 5 modifications that we can apply:<br />
<br />
'''1.''' Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
'''2.''' Reversal Indicate Features. It added a binary feature.<br />
<br />
'''3.''' Custom ACL Embeddings. It embedded a word vector to an ACL-specific<br />
corps.<br />
<br />
'''4.''' Context words. Within the sentence, it varies in size on a context window<br />
around the entity-enclosed text.<br />
<br />
'''5.''' Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in table 3.<br />
<br />
[[File:table3.PNG|center]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more natural and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increase the score on the cross-validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG|center]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for the training set 1.1 is the third submission which did not<br />
use the cross-validation as the test set. Instead, it runs a constant number of<br />
training epochs, and it can be chosen by cross-validation based on the training data. The best submission for the training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, early stopping cannot always be based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly improve the performance of<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM, and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significantly better performance than all others. Attention-based pooling, up-sampling, and data augmentation are also tested, but they barely perform positive increment on the behavior.<br />
<br />
== Critiques == <br />
<br />
- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.<br />
<br />
- In the section of previous work, the author mentioned 9 natural language relationships between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.<br />
<br />
-This topic is interesting and this application might be helpful for some educational websites to improve their website to help readers focus on the important points. I think it will be nice to use Latex to type the equation in the sentence rather than center the equation on the next line. I think it will be interesting to discuss applying this way to other languages such as Chinese, Japanese, etc.<br />
<br />
- It would be a good idea if the authors can provide more details regarding ACL Embeddings and Context words modifications. Scores generated using these two modifications are quite close to the highest Ensembling modification generated score, which makes it a valid consideration to examine these two modifications in detail. <br />
<br />
- This paper is dealing with a similar problem as 'Neural Speed Reading Via Skim-RNN', num 19 paper summary. It will be an interesting approach to compare these two models' performance based on the same dataset.<br />
<br />
- I think it would be highly practical to implement this system as a page-rank system for search engines (such as google, bing, or other platforms like facebook, instagram, etc.) by finding the most prevalent information available in a search query and then matching the search to related text which can be found on webpages. This could also be implemented in search bars on specific websites or locations as well.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=47377Semantic Relation Classification——via Convolution Neural Network2020-11-28T20:39:39Z<p>Y2587wan: /* Previous Work */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE.<br />
<br />
SemEval 2018 Task 7 extracted data from 350 scientific paper abstracts, which has 1228 and 1248 annotated sentences for two tasks, respectively. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). In the end, the prediction based on the CNN model was finally submitted since it performed the best among all models. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8 (Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis, and their challenge differs from the challenge of this paper in its generalizability; this paper’s relations are specific to ACL papers (e.g. MODEL-FEATURE), whereas the 2010 relations are more general, and might necessitate more common-sense knowledge than the 2018 relations. Xu et al. (2015a) and Santos et al. (2015) both applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px|center]]<br />
<br />
This is the architecture of CNN. We first transform a sentence via Feature embeddings. Basically, we transform each sentence into continuous word embeddings:<br />
<br />
$$<br />
(e^{w_i})<br />
$$<br />
<br />
And word position embeddings:<br />
$$<br />
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]<br />
$$<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
are given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor is produced:<br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some<br />
modifications to the model in order to increase performance on the test set. There are 5 modifications that we can apply:<br />
<br />
'''1.''' Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
'''2.''' Reversal Indicate Features. It added a binary feature.<br />
<br />
'''3.''' Custom ACL Embeddings. It embedded word vector to an ACL-specific<br />
corps.<br />
<br />
'''4.''' Context words. Within the sentence, it varies in size on a context window<br />
around the entity-enclosed text.<br />
<br />
'''5.''' Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in table 3.<br />
<br />
[[File:table3.PNG|center]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more natural and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increase the score on the cross-validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG|center]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for the training set 1.1 is the third submission which did not<br />
use the cross-validation as the test set. Instead, it runs a constant number of<br />
training epochs, and it can be chosen by cross-validation based on the training data. The best submission for the training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, early stopping cannot always be based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly improve the performance of<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM, and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significantly better performance than all others. Attention-based pooling, up-sampling, and data augmentation are also tested, but they barely perform positive increment on the behavior.<br />
<br />
== Critiques == <br />
<br />
- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.<br />
<br />
- In the section of previous work, the author mentioned 9 natural language relationships between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.<br />
<br />
-This topic is interesting and this application might be helpful for some educational websites to improve their website to help readers focus on the important points. I think it will be nice to use Latex to type the equation in the sentence rather than center the equation on the next line. I think it will be interesting to discuss applying this way to other languages such as Chinese, Japanese, etc.<br />
<br />
- It would be a good idea if the authors can provide more details regarding ACL Embeddings and Context words modifications. Scores generated using these two modifications are quite close to the highest Ensembling modification generated score, which makes it a valid consideration to examine these two modifications in detail. <br />
<br />
- This paper is dealing with a similar problem as 'Neural Speed Reading Via Skim-RNN', num 19 paper summary. It will be an interesting approach to compare these two models' performance based on the same dataset.<br />
<br />
- I think it would be highly practical to implement this system as a page-rank system for search engines (such as google, bing, or other platforms like facebook, instagram, etc.) by finding the most prevalent information available in a search query and then matching the search to related text which can be found on webpages. This could also be implemented in search bars on specific websites or locations as well.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=47376Graph Structure of Neural Networks2020-11-28T20:37:30Z<p>Y2587wan: /* Critique */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes, and directed edges link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layers and ultimately to the output neurons.<br />
<br />
In Neural Network research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. A natural choice is to use computational graph representation, but this has many limitations including a lack of generality and disconnection with biology/neuroscience. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, the authors developed a new way of representing a neural network as a graph, called a relational graph. The key insight in the new representation is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows for the systematic exploration of the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graph to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. <br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on different types of neural networks (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then output a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of message function) and outputs the updated node feature. <br />
<br />
A relation graph is a graph <math>G</math> associated with several rounds of message exchange, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message change does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. The representation of these more complicated networks are described in detail in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient in this paper.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes; the clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all the nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results shows as below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures; notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use FLOPS (# of multiply-adds) as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Discussions and Conclusions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There are always exists graph structure that has higher predictive accuracy under Top-1 error compare to the complete graph<br />
<br />
* There is a sweet spot that the graph structure near the sweet spot usually outperform the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments is consistent across multiple datasets and multiple graph structure with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There is a similarity between the best artificial neurons and biological neurons.<br />
<br />
----<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
== Neural networks performance depends on its structure ==<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compared to the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, the best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance.<br />
<br />
== Sweet spot where performance is significantly improved ==<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
== Relationship between neural network’s performance and parameters == <br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial is a good visualization tool for the overall trend ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a U-shape. <br />
<br />
== Consistency among many different tasks and datasets ==<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
== Top architectures can be identified efficiently ==<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
== Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks==<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staked inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is a theoretical justification for why these graph properties may (or may not) contribute to the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentations of images?<br />
<br />
6. There is an interesting insight that the idea of the relational graph is kind of similar to applying causal graphs in neuro networks, which is also closer to biology and neuroscience because human beings learning things based on causality. This new approach may lead to higher prediction accuracy but it needs more assumptions, such as correct relations and causalities.<br />
<br />
7. This is an interesting topic that uses the knowledge in graph theory to introduce this new structure of Neural Networks. Using more data sets to discuss this approach might be more interesting, such as the MNIST dataset. We think it is interesting to discuss whether this structure will provide a better performance compare to the "traditional" structure of NN in any type of Neural Networks.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=47375Graph Structure of Neural Networks2020-11-28T20:35:19Z<p>Y2587wan: /* Discussions and Conclusions */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes, and directed edges link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layers and ultimately to the output neurons.<br />
<br />
In Neural Network research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. A natural choice is to use computational graph representation, but this has many limitations including a lack of generality and disconnection with biology/neuroscience. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, the authors developed a new way of representing a neural network as a graph, called a relational graph. The key insight in the new representation is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows for the systematic exploration of the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graph to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. <br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on different types of neural networks (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then output a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of message function) and outputs the updated node feature. <br />
<br />
A relation graph is a graph <math>G</math> associated with several rounds of message exchange, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message change does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. The representation of these more complicated networks are described in detail in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient in this paper.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes; the clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all the nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results shows as below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures; notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use FLOPS (# of multiply-adds) as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Discussions and Conclusions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There are always exists graph structure that has higher predictive accuracy under Top-1 error compare to the complete graph<br />
<br />
* There is a sweet spot that the graph structure near the sweet spot usually outperform the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments is consistent across multiple datasets and multiple graph structure with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There is a similarity between the best artificial neurons and biological neurons.<br />
<br />
----<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
== Neural networks performance depends on its structure ==<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compared to the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, the best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance.<br />
<br />
== Sweet spot where performance is significantly improved ==<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
== Relationship between neural network’s performance and parameters == <br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial is a good visualization tool for the overall trend ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a U-shape. <br />
<br />
== Consistency among many different tasks and datasets ==<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
== Top architectures can be identified efficiently ==<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
== Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks==<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staked inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is the theoretical justification for why these graph properties may (or may not) contribute the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentation of images?<br />
<br />
6. There is an interesting insight that the idea of relational graph is kind of similar to applying causal graph in neuro networks, which is also closer to biology and neuroscience because human beings learning things based on causality. This new approach may lead to higher prediction accuracy but it needs more assumption, such as correct relations and causalities.<br />
<br />
7. This is an interesting topic that uses the knowledge in graph theory to introduce this new structure of Neural Networks. Using more data sets to discuss this approach might be more interesting, such as MNIST dataset. I think it is interesting to discuss whether this structure will provide a better performance compare to the "traditional" structure of NN in any type of Neural Networks.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks&diff=47374Graph Structure of Neural Networks2020-11-28T20:33:43Z<p>Y2587wan: /* Neural Network as Relational Graph */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. The architecture of a neural network can be captured by its "computational graph", where neurons are represented as nodes, and directed edges link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layers and ultimately to the output neurons.<br />
<br />
In Neural Network research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. A natural choice is to use computational graph representation, but this has many limitations including a lack of generality and disconnection with biology/neuroscience. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, the authors developed a new way of representing a neural network as a graph, called a relational graph. The key insight in the new representation is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows for the systematic exploration of the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under the insights of neuroscience.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graph to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. <br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on different types of neural networks (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then output a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of message function) and outputs the updated node feature. <br />
<br />
A relation graph is a graph <math>G</math> associated with several rounds of message exchange, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message change does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. The representation of these more complicated networks are described in detail in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient in this paper.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes; the clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all the nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results shows as below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures; notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use FLOPS (# of multiply-adds) as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Discussions and Conclusions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There are always exists graph structure that has higher predictive accuracy under Top-1 error compare to the complete graph<br />
<br />
* There is a sweet spot that the graph structure near the sweet spot usually outperform the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments is consistent across multiple datasets and multiple graph structure with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There is a similarity between best artificial neurons and biological neurons.<br />
<br />
----<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
== Neural networks performance depends on its structure ==<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compared to the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, the best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance.<br />
<br />
== Sweet spot where performance is significantly improved ==<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
== Relationship between neural network’s performance and parameters == <br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial is a good visualization tool for the overall trend ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a U-shape. <br />
<br />
== Consistency among many different tasks and datasets ==<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
== Top architectures can be identified efficiently ==<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
== Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks==<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staked inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is the theoretical justification for why these graph properties may (or may not) contribute the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentation of images?<br />
<br />
6. There is an interesting insight that the idea of relational graph is kind of similar to applying causal graph in neuro networks, which is also closer to biology and neuroscience because human beings learning things based on causality. This new approach may lead to higher prediction accuracy but it needs more assumption, such as correct relations and causalities.<br />
<br />
7. This is an interesting topic that uses the knowledge in graph theory to introduce this new structure of Neural Networks. Using more data sets to discuss this approach might be more interesting, such as MNIST dataset. I think it is interesting to discuss whether this structure will provide a better performance compare to the "traditional" structure of NN in any type of Neural Networks.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=47372Superhuman AI for Multiplayer Poker2020-11-28T20:29:48Z<p>Y2587wan: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy (for example, in the "Prisoner's dilemma" the Nash equilibrium of both players betraying each other is not the optimal strategy).<br />
<br />
More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. An example of an AI model that can successfully beat two players in poker is Libratus, which is an AI developed in 2017 that also used MCCFR. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.<br />
<br />
To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior. <br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock, and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision, and zero regret indicates that you are indifferent.<br />
<br />
The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This leads to the strategy improving more quickly in practice.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
[[File:top.PNG| 950px | x450px |left]]<br />
<br />
<br />
<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div> <br />
<br />
Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion and Critiques ==<br />
<br />
Pluribus' Blueprint strategy and Abstraction methods effectively reduce the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using a non-theoretical approach in more real-life problems such as autonomous driving or stock market trading.<br />
<br />
Extending this idea beyond two-player zero sum games will have many applications in real life.<br />
<br />
The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example: sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.<br />
<br />
Superhuman AI, while sounding superior, is actually not uncommon. There have been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.<br />
<br />
This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking more applications in other areas? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.<br />
<br />
One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents' cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize its profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing).<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (12 Jul 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (2020, November 17). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero<br />
<br />
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.<br />
<br />
N. Brown and T. Sandholm, "Superhuman AI for heads-up no-limit poker: Libratus beats top professionals", Science, vol. 359, no. 6374, pp. 418-424, 2017. Available: 10.1126/science.aao1733 [Accessed 27 November 2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=47371Superhuman AI for Multiplayer Poker2020-11-28T20:27:10Z<p>Y2587wan: /* Theoretical Analysis */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy (for example, in the "Prisoner's dilemma" the Nash equilibrium of both players betraying each other is not the optimal strategy).<br />
<br />
More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. An example of an AI model that can successfully beat two players in poker is Libratus, which is an AI developed in 2017 that also used MCCFR. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.<br />
<br />
To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior. <br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock, and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision, and zero regret indicates that you are indifferent.<br />
<br />
The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This leads to the strategy improving more quickly in practice.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
[[File:top.PNG| 950px | x450px |left]]<br />
<br />
<br />
<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div> <br />
<br />
Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion and Critiques ==<br />
<br />
Pluribus' Blueprint strategy and Abstraction methods effectively reduces the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
Extending this idea beyond two player zero sum games will have many applications in real life.<br />
<br />
The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example: sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.<br />
<br />
Superhuman AI, while sounding superior, is actually not uncommon. There has been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.<br />
<br />
This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking more applications in other area? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.<br />
<br />
One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize it's profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing).<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (12 Jul 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (2020, November 17). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero<br />
<br />
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.<br />
<br />
N. Brown and T. Sandholm, "Superhuman AI for heads-up no-limit poker: Libratus beats top professionals", Science, vol. 359, no. 6374, pp. 418-424, 2017. Available: 10.1126/science.aao1733 [Accessed 27 November 2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=47370Superhuman AI for Multiplayer Poker2020-11-28T20:26:43Z<p>Y2587wan: /* Nash Equilibrium in Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy (for example, in the "Prisoner's dilemma" the Nash equilibrium of both players betraying each other is not the optimal strategy).<br />
<br />
More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. An example of an AI model that can successfully beat two players in poker is Libratus, which is an AI developed in 2017 that also used MCCFR. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.<br />
<br />
To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior. <br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock, and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision, and zero regret indicates that you are indifferent.<br />
<br />
The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This leads to the strategy improving more quickly in practice.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
[[File:top.PNG| 950px | x450px |left]]<br />
<br />
<br />
<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div> <br />
<br />
Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion and Critiques ==<br />
<br />
Pluribus' Blueprint strategy and Abstraction methods effectively reduces the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
Extending this idea beyond two player zero sum games will have many applications in real life.<br />
<br />
The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example: sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.<br />
<br />
Superhuman AI, while sounding superior, is actually not uncommon. There has been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.<br />
<br />
This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking more applications in other area? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.<br />
<br />
One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize it's profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing).<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (12 Jul 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (2020, November 17). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero<br />
<br />
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.<br />
<br />
N. Brown and T. Sandholm, "Superhuman AI for heads-up no-limit poker: Libratus beats top professionals", Science, vol. 359, no. 6374, pp. 418-424, 2017. Available: 10.1126/science.aao1733 [Accessed 27 November 2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=47369Superhuman AI for Multiplayer Poker2020-11-28T20:24:45Z<p>Y2587wan: /* Nash Equilibrium in Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that was built can only beat human players in two-player zero-sum games. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy (for example, in the "Prisoner's dilemma" the Nash equilibrium of both players betraying each other is not the optimal strategy).<br />
<br />
More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. An example of an AI model that can successfully beat two players in poker is Libratus, which is an AI developed in 2017 that also used MCCFR. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games. However, the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.<br />
<br />
To have a deeper understanding of Nash Equilibria we must first define some basic game theory concepts. The first one being a strategic game, in game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior. <br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy overtime to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision, and zero regret indicates that you are indifferent.<br />
<br />
The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This leads to the strategy improving more quickly in practice.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
[[File:top.PNG| 950px | x450px |left]]<br />
<br />
<br />
<div align="center">"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div> <br />
<br />
Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion and Critiques ==<br />
<br />
Pluribus' Blueprint strategy and Abstraction methods effectively reduces the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
Extending this idea beyond two player zero sum games will have many applications in real life.<br />
<br />
The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example: sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.<br />
<br />
Superhuman AI, while sounding superior, is actually not uncommon. There has been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.<br />
<br />
This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking more applications in other area? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.<br />
<br />
One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize it's profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing).<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (12 Jul 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (2020, November 17). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero<br />
<br />
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.<br />
<br />
N. Brown and T. Sandholm, "Superhuman AI for heads-up no-limit poker: Libratus beats top professionals", Science, vol. 359, no. 6374, pp. 418-424, 2017. Available: 10.1126/science.aao1733 [Accessed 27 November 2020].</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence&diff=47368Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-28T20:22:53Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is being easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) has become an important part of human’s life. People can share their experiences in a location, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can provide a service, which is called personalized POI recommendation, to give recommendations to users that the location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e. Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) it is difficult to incorporate geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for self-attentive encoder while NAD stands for the neighbor-aware decoder. Autoencoder is an unsupervised learning technique that we implement in neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produces a compressed knowledge representation of the original input. This method will include machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing the power of such recommenders. However, the SAE adaptively differentiates the user preference degrees in multiple aspects.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g. Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation.<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which consists of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector x_u to hidden representation A_u using weight matrix W^1 and the activation function softmax and tanh. The 0's and 1's in x_u indicates whether the user has been to a certain POI. The weight matrix W_a assigns different weights on various features of POIs.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surrounds the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location are very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez are very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical for one to understand the meanings behind each evaluation metrics. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it is able to generate recommendations that are highly relevant to its users.<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
This idea would have many applications such as in food delivery service apps to suggest new restaurants to customers.<br />
<br />
It would be nice if the author could show the comparison result in tables vs other methodologies. Both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers.<br />
<br />
It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, we do not have a clear understanding of how important each component is to the recommendation quality.<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence&diff=47366Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-28T20:20:39Z<p>Y2587wan: /* Motivation */</p>
<hr />
<div>== Presented by == <br />
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is being easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) has become an important part of human’s life. People can share their experiences in a location, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can provide a service, which is called personalized POI recommendation, to give recommendations to users that the location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e. Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) it is difficult to incorporate geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for self-attentive encoder while NAD stands for the neighbor-aware decoder. Autoencoder is an unsupervised learning technique that we implement in neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produces a compressed knowledge representation of the original input. This method will include machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing the power of such recommenders. However, the SAE adaptively differentiates the user preference degrees in multiple aspects.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g. Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation.<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which consists of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector x_u to hidden representation A_u using weight matrix W^1 and the activation function softmax and tanh. The 0's and 1's in x_u indicates whether the user has been to a certain POI. The weight matrix W_a assigns different weights on various features of POIs.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surrounds the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location are very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez are very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical for one to understand the meanings behind each evaluation metrics. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it is able to generate recommendations that are highly relevant to its users.<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
This idea would have many applications such as in food delivery service apps to suggest new restaurants to customers.<br />
<br />
It would be nice if the author could show the comparison result in tables vs other methodologies. Both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers.<br />
<br />
It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, I do not have a clear understanding of how important each component is to the recommendation quality.<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence&diff=47365Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-28T20:19:37Z<p>Y2587wan: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is being easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) has become an important part of human’s life. People can share their experiences in a location, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can provide a service, which is called personalized POI recommendation, to give recommendations to users that the location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e. Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) it is difficult to incorporate geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for self-attentive encoder while NAD stands for the neighbor-aware decoder. Autoencoder is an unsupervised learning technique that we implement in neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produces a compressed knowledge representation of the original input. This method will include machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing the power of such recommenders. However, the SAE adaptively differentiates the user preference degrees in multiple aspects.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g. Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation.<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which is consist of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector x_u to hidden representation A_u using weight matrix W^1 and the activation function softmax and tanh. The 0's and 1's in x_u indicates whether the user has been to a certain POI. The weight matrix W_a assigns different weights on various features of POIs.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surrounds the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location are very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez are very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical for one to understand the meanings behind each evaluation metrics. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it is able to generate recommendations that are highly relevant to its users.<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
This idea would have many applications such as in food delivery service apps to suggest new restaurants to customers.<br />
<br />
It would be nice if the author could show the comparison result in tables vs other methodologies. Both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers.<br />
<br />
It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, I do not have a clear understanding of how important each component is to the recommendation quality.<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=47364Neural Speed Reading via Skim-RNN2020-11-28T20:18:02Z<p>Y2587wan: /* Critiques */</p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
Recurrent Neural Network (RNN) is the connection between artificial neural network nodes forming a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that are irrelevant when it comes to answering a query that is being asked.<br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Related Works ===<br />
<br />
As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model which uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering. <br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with a high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN it can replace RNNs in existing applications.<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.<br />
<br />
2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analyzing. For example, a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.<br />
<br />
3. An appropriate application for Skim-RNN could be customer service chatbots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers' time.<br />
<br />
4. This could be applied to news apps to improve readability by highlighting important sections.<br />
<br />
5. This summary describes an interesting and useful model which can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job. <br />
<br />
6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.<br />
<br />
== Applications ==<br />
<br />
Recurrent architectures are used in many other applications, such as for processing video. Real-time video processing is an exceedingly demanding and resource-constrained task, particularly in edge settings. It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.<br />
<br />
[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=47360Neural Speed Reading via Skim-RNN2020-11-28T20:16:16Z<p>Y2587wan: /* Results */</p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
Recurrent Neural Network (RNN) is the connection between artificial neural network nodes forming a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that are irrelevant when it comes to answering a query that is being asked.<br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Related Works ===<br />
<br />
As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model which uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering. <br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with a high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN it can replace RNNs in existing applications.<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.<br />
<br />
2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analyzing. For example a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.<br />
<br />
3. An appropriate application for Skim-RNN could be customer service chat bots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers time.<br />
<br />
4. This could be applied to news apps to improve readability by highlighting important sections.<br />
<br />
5. This summary describes an interesting and useful model which can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job. <br />
<br />
6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.<br />
<br />
== Applications ==<br />
<br />
Recurrent architectures are used in many other applications, such as for processing video. Real-time video processing is an exceedingly demanding and resource-constrained task, particularly in edge settings. It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.<br />
<br />
[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=47357Neural Speed Reading via Skim-RNN2020-11-28T20:13:49Z<p>Y2587wan: /* Training */</p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
Recurrent Neural Network (RNN) is the connection between artificial neural network nodes forming a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that are irrelevant when it comes to answering a query that is being asked.<br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Related Works ===<br />
<br />
As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model which uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering. <br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss over the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN it can replace RNNs in existing applications.<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.<br />
<br />
2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analyzing. For example a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.<br />
<br />
3. An appropriate application for Skim-RNN could be customer service chat bots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers time.<br />
<br />
4. This could be applied to news apps to improve readability by highlighting important sections.<br />
<br />
5. This summary describes an interesting and useful model which can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job. <br />
<br />
6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.<br />
<br />
== Applications ==<br />
<br />
Recurrent architectures are used in many other applications, such as for processing video. Real-time video processing is an exceedingly demanding and resource-constrained task, particularly in edge settings. It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.<br />
<br />
[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=47355Neural Speed Reading via Skim-RNN2020-11-28T20:13:02Z<p>Y2587wan: /* Inference */</p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
Recurrent Neural Network (RNN) is the connection between artificial neural network nodes forming a directed graph along with time series and has time dynamic behavior. RNN is derived from a feedforward neural network and can use its memory to process variable-length input sequences. This makes it suitable for tasks such as unsegmented, connected handwriting recognition, and speech recognition.<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, an RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that are irrelevant when it comes to answering a query that is being asked.<br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts that do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas that will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Related Works ===<br />
<br />
As the popularity of neural networks has grown, significant attention has been given to make them faster and lighter. In particular, relevant work focused on reducing the computational cost of recurrent neural networks has been carried out by several other related works. For example, LSTM-Jump (You et al., 2017) [8] models aim to speed up run times by skipping certain input tokens, as opposed to skimming them. Choi et al. (2017)[9] proposed a model which uses a CNN-based sentence classifier to determine the most relevant sentence(s) to the question and then uses an RNN-based question-answering model. This model focuses on reducing GPU run-times (as opposed to Skim-RNN which focuses on minimizing CPU-time or Flop), and is also focused only on question answering. <br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d' \ll d</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating-point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case, <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss o ver the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question-answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on a specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with an initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Also, it is interesting to know that the accuracy improvement over LSTM could be due to the increased stability of the hidden state, as the majority of the hidden state is not updated when skimming. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and the blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset and simple enough to run well-controlled analyses for the Skim-RNN. The second model was an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps, and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important in the second layer. Second, higher <math>\gamma</math> values lead to a higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with a comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, which are used to estimate the speedup of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has a direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework or more parallelization. The gap also decreases as the hidden state size increases because the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provides greater benefits for RNNs with larger hidden state size. However, combining Skim-RNN with a CPU-based framework can lead to substantially lower latency than GPUs.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in the second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN it can replace RNNs in existing applications.<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important, thus it can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art neural network models.<br />
<br />
2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analyzing. For example a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.<br />
<br />
3. An appropriate application for Skim-RNN could be customer service chat bots as they can analyze a customer's message and skim associated company policies to craft a response. In this circumstance, quickly analyzing text is ideal to not waste customers time.<br />
<br />
4. This could be applied to news apps to improve readability by highlighting important sections.<br />
<br />
5. This summary describes an interesting and useful model which can save readers time for reading an article. I think it will be interesting that discuss more on training a model by Skim-RNN to highlight the important sections in very long textbooks. As a student, having highlights in the textbook is really helpful to study. But highlight the important parts in a time-consuming work for the author, maybe using Skim-RNN can provide a nice model to do this job. <br />
<br />
6. Besides the good training performance of Skim-RNN, it's good to see the algorithm even performs well simply by training with CPU. It would make it possible to perform the result on lite-platforms.<br />
<br />
== Applications ==<br />
<br />
Recurrent architectures are used in many other applications, such as for processing video. Real-time video processing is an exceedingly demanding and resource-constrained task, particularly in edge settings. It would be interesting to see if this method could be applied to those cases for more efficient inference, such as on drones or self-driving cars.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.<br />
<br />
[8] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. In ACL, 2017.<br />
<br />
[9] Eunsol Choi, Daniel Hewlett, Alexandre Lacoste, Illia Polosukhin, Jakob Uszkoreit, and Jonathan Berant. Coarse-to-fine question answering for long documents. In ACL, 2017.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=42632F21-STAT 441/841 CM 763-Proposal2020-10-08T02:04:40Z<p>Y2587wan: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.</div>Y2587wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=42629stat441F212020-10-07T23:49:37Z<p>Y2587wan: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks] || ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| || || ||<br />
|-<br />
|Week of Nov 16 || || 6|| || || ||<br />
|-<br />
|Week of Nov 16 || || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || || 8|| || || ||<br />
|-<br />
|Week of Nov 16 || || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || || 10|| || || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Netorks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || || 13|| || || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || || 15|| || || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| || || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || || 21|| || || ||<br />
|-<br />
|Week of Nov 30 || || 22|| || || ||<br />
|-<br />
|Week of Nov 30 || || 23|| || || ||<br />
|-<br />
|Week of Nov 30 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 || || 26|| || || ||<br />
|-<br />
|Week of Nov 30 || || 27|| || || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, DanMeng Cui, ZiJie Jiang, MingKang Jiang || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-</div>Y2587wan