http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Hhalim&feedformat=atomstatwiki - User contributions [US]2022-08-18T17:59:00ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&diff=48456Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms2020-11-30T16:36:36Z<p>Hhalim: /* Result */ From page 11 second paragraph</p>
<hr />
<div><br />
== Presented by ==<br />
<br />
Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee<br />
<br />
== Introduction ==<br />
<br />
This paper presents an approach to detecting heart disease from ECG signals by fine-tuning the deep learning neural network, ConvNetQuake. For context, ConvNetQuake is a convolutional neural network, used by Perol, Gharbi, and Denolle [4], for Earthquake detection and location from a single waveform. A deep learning approach was used due to the model's ability to be trained using multiple GPUs and terabyte-sized datasets. This, in turn, creates a model that is robust against noise. The purpose of this paper is to provide detailed analyses of the contributions of the ECG leads on identifying heart disease, to show the use of multiple channels in ConvNetQuake enhances prediction accuracy, and to show that feature engineering is not necessary for any of the training, validation, or testing processes. In this area, the combination of data fusion and machine learning techniques exhibits great promise to healthcare innovation, and the analyses in this paper help further this realization. The benefits of translating knowledge between deep learning and its real-world applications in health are also illustrated.<br />
<br />
== Previous Work and Motivation ==<br />
<br />
The database used in previous works is the Physikalisch-Technische Bundesanstalt (PTB) database, which consists of ECG records. Previous papers used techniques, such as CNN, SVM, K-nearest neighbors, naïve Bayes classification, and ANN. From these instances, the paper observes several faults in the previous papers. The first being the issue that most papers use feature selection on the raw ECG data before training the model. Dabanloo and Attarodi [2] used various techniques such as ANN, K-nearest neighbors, and Naïve Bayes. However, they extracted two features, the T-wave integral and the total integral, to aid in localizing and detecting heart disease. Sharma and Sunkaria [3] used SVM and K-nearest neighbors as their classifier, but extracted various features using stationary wavelet transforms to decompose the ECG signal into sub-bands. The second issue is that papers that do not use feature selection would arbitrarily pick ECG leads for classification without rationale. For example, Liu et al. [1] used a deep CNN that uses 3 seconds of ECG signal from lead II at a time as input. The decision for using lead II compared to the other leads was not explained. <br />
<br />
The issue with feature selection is that it can be time-consuming and impractical with large volumes of data. The second issue with the arbitrary selection of leads is that it does not offer insight into why the lead was chosen and the contributions of each lead in the identification of heart disease. Thus, this paper addresses these two issues through implementing a deep learning model that does not rely on feature selection of ECG data and to quantify the contributions of each ECG and Frank lead in identifying heart disease.<br />
<br />
== Model Architecture ==<br />
<br />
The dataset, which was used to train, validate, and test the neural network models, consists of 549 ECG records taken from 290 unique patients. Each ECG record has a mean length of over 100 seconds.<br />
<br />
This Deep Neural Network model was created by modifying the ConvNetQuake model by adding 1D batch normalization layers.<br />
<br />
During the training stage, a 10-second long two-channel input was fed into the neural network. In order to ensure that the two channels were weighted equally, both channels were normalized. Besides, time invariance was incorporated by selecting the 10-second long segment randomly from the entire signal. <br />
<br />
The input layer is a 10-second long ECG signal. There are 8 hidden layers in this model, each of which consists of a 1D convolution layer with the ReLu activation function followed by a batch normalization layer. The output layer is a one-dimensional layer that uses the Sigmoid activation function.<br />
<br />
This model is trained by using batches of size 10. The learning rate is 10^-4. The ADAM optimizer is used. In training the model, the dataset is split into a train set, validation set, and test set with ratios 80-10-10.<br />
<br />
During the training process, the model was trained from scratch numerous times to avoid inserting unintended variation into the model by randomly initializing weights.<br />
<br />
The following images gives a visual representation of the model.<br />
<br />
[[File:architecture.png | thumb | center | 1000px | Model Architecture (Gupta et al., 2019)]]<br />
<br />
==Result== <br />
<br />
The paper first uses quantification of accuracies for single channels with 20-fold cross-validation, resulting in the highest individual accuracies: v5, v6, vx, vz, and ii. The researcher further investigated the accuracies for pairs of the top 5 highest individual channels using 20-fold cross-validation. The arrived at the conclusion of highest pairs accuracies to fed into a neural network is lead v6 and lead vz. They then use 100-fold cross validation on v6 and vz pair of channels, then compare outliers based on top 20, top 50 and total 100 performing models, finding that standard deviation is non-trivial and there are few models performed very poorly. <br />
<br />
Next, they discussed 2 factors affecting model performance evaluation: 1） Random train-val-test split might have effects on the performance of the model, but it can be improved by access with a larger data set and further discussion; and 2） random initialization of the weights of the neural network shows little effects on the performance of the model performance evaluation, because of showing high average results with a fixed train-val-test split. <br />
<br />
Comparing with other models in the other 12 papers, the model in this article has the highest accuracy, specificity, and precision. With concerns of patients' records affecting the training accuracy, they used 290 fold patient-wise split, resulting in the same highest accuracy of the pair v6 and vz same as record-wise split. The second best pair was ii and vz, which also contains the vz channel. Combining the two best pair channels into v6, vz, vii ultimately gave the best results over 10 trials which has an average of 97.83% in patient-wise split. Even though the patient-wise split might result in lower accuracy evaluation, however, it still maintains a very high average.<br />
<br />
==Conclusion & Discussion== <br />
<br />
The paper introduced a new architecture for heart condition classification based on raw ECG signals using multiple leads. It outperformed the state-of-art model by a large margin of 1 percent. This study finds that out of the 15 ECG channels(12 conventional ECG leads and 3 Frank Leads), channel v6, vz, and ii contain the most meaningful information for detecting myocardial infraction. Also, recent advances in machine learning can be leveraged to produce a model capable of classifying myocardial infraction with a cardiologist-level success rate. To further improve the performance of the models, access to a larger labeled data set is needed. The PTB database is small. It is difficult to test the true robustness of the model with a relatively small test set. If a larger data set can be found to help correctly identify other heart conditions beyond myocardial infraction, the research group plans to share the deep learning models and develop an open-source, computationally efficient app that can be readily used by cardiologists.<br />
<br />
A detailed analysis of the relative importance of each of the standard 15 ECG channels indicates that deep learning can identify myocardial infraction by processing only ten seconds of raw ECG data from the v6, vz and ii leads and reaches a cardiologist-level success rate. Deep learning algorithms may be readily used as commodity software. The neural network model that was originally designed to identify earthquakes may be re-designed and tuned to identify myocardial infraction. Feature engineering of ECG data is not required to identify myocardial infraction in the PTB database. This model only required ten seconds of raw ECG data to identify this heart condition with cardiologist-level performance. Access to a larger database should be provided to deep learning researchers so they can work on detecting different types of heart conditions. Deep learning researchers and the cardiology community can work together to develop deep learning algorithms that provide trustworthy, real-time information regarding heart conditions with minimal computational resources.<br />
<br />
Fourier Transform(such as FFT) can be helpful when dealing with ECG signals. It transforms signals from time domain to frequency domain, which means some hidden features in frequency may be discovered.<br />
<br />
==Critiques==<br />
- The lack of large, labelled data sets is often a common problem in most applied deep learning studies. Since the PTB database is as small as you describe it to be, the robustness of the model which may be hard to gauge. There are very likely various other physical factors that may play a role in the study which the deep neural network may not be able to adjust for as well, since health data can be somewhat subjective at times and/or may be somewhat inaccurate, especially if machines are used to measurement. This might mean error was propagated forward in the study.<br />
<br />
- Additionally, there is a risk of confirmation bias, which may occur when a model is self-training, especially given the fact that the training set is small.<br />
<br />
- I feel that the results of deep learning models in medical settings where the consequences of misclassification can be severe should be evaluated by assigning weights to classification. In case if the misclassification can lead to severe consequences, then the network should be trained in such a way that it errs towards safety. For example, in case if heart disease, the consequences will be very high if the system says that there is no heart disease when in fact there is. So, the evaluation metric must be selected carefully.<br />
<br />
- This is a useful and meaningful application topic in machine learning. Using Deep Learning to detect heart disease can be very helpful if it is difficult to detect disease by looking at ECG by humans eys. This model also useful for doing statistics, such as calculating the percentage of people get heart disease. But I think the doctor should not 100% trust the result from the model, it is almost impossible to get 100% accuracy from a model. So, I think double-checking by human eyes is necessary if the result is weird. What is more, I think it will be interesting to discuss more applications in mediccal by using this method, such as detecting the Brainwave diagram to predict a person's mood and to diagnose mental diseases.<br />
<br />
- Compared to the dataset for other topics such as object recognition, the PTB database is pretty small with only 549 ECG records. And these are highly unbiased(Table 1) with 4 records for myocarditis and 148 for myocardial infarction. Medical datasets can only be labeled by specialists. This is why these datasets are related small. It would be great if there will be a larger, more comprehensive dataset.<br />
<br />
== References ==<br />
<br />
[1] Na Liu et al. "A Simple and Effective Method for Detecting Myocardial Infarction Based on Deep Convolutional Neural Network". In: Journal of Medical Imaging and Health Informatics (Sept. 2018). doi: 10.1166/jmihi.2018.2463.<br />
<br />
[2] Naser Safdarian, N.J. Dabanloo, and Gholamreza Attarodi. "A New Pattern Recognition Method for Detection and Localization of Myocardial Infarction Using T-Wave Integral and Total Integral as Extracted Features from One Cycle of ECG Signal". In: J. Biomedical Science and Engineering (Aug. 2014). doi: http://dx.doi.org/10.4236/jbise.2014.710081.<br />
<br />
[3] L.D. Sharma and R.K. Sunkaria. "Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach." In: Signal, Image and Video Processing (July 2017). doi: https://doi.org/10.1007/s11760-017-1146-z.<br />
<br />
[4] Perol Thibaut, Gharbi Michaël, and Denolle Marin. "Convolutional neural network for earthquake detection and location". In: Science Advances (Feb. 2018). doi: 10.1126/sciadv.1700578</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_Voice&diff=48448Speech2Face: Learning the Face Behind a Voice2020-11-30T15:50:10Z<p>Hhalim: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face. This architecture utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model learns the correlations, allowing it to produce facial reconstruction images that capture specific physical attributes, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, to predict lip motion from speech and even learn the emotion of the agents based on their voices.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we have seen what they look lke. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors were more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg]]<br />
<br />
Figure 1. '''Speech2Face model and training pipeline''' <br />
<br />
<br />
<br />
The Speech2Face Model used to achieve the desired result consist of 2 parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The face decoder itself was taken from previous work by Cole et al [3] and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The two results are combined to form an image. This model was trained using the VGG-Face model as input. It was also trained separately and remained fixed during the voice encoder training. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. The VGG-Face model, a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG]]<br />
<br />
Table 1: '''Voice encoder architecture'''<br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the face decoder <math>f_{VGG}</math> and the first layer <math>f_{dec}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}logp_{(i)}(b)$$ $$p_{(i)}(a) = \frac{exp(a_i/T)}{\sum_jexp(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
Figure 2: '''Qualitative results on the AVSpeech test set'''<br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
Figure 3. '''Facial attribute evaluation''' <br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
Table 2. '''Feature similarity'''<br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2f -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
Table 3. '''S2F -> Face retrieval performance'''<br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, was developed in which the K closest images in distance to the output of the model are found, and the chance that the original image is within those K images is the R@K score. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. <br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate. <br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real world problems. Because there are many factors to affect a person's sound in the real world environment. Sounds such like phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have highly destructive outcome for a falsely convicted suspect.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNN&diff=48444Music Recommender System Based using CRNN2020-11-30T15:31:57Z<p>Hhalim: /* Data Screening: */</p>
<hr />
<div>==Introduction and Objective:==<br />
<br />
In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data.<br />
<br />
The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.<br />
<br />
This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.<br />
<br />
The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.<br />
<br />
The following two-method approach to building the recommendation system was followed:<br />
#Make recommendations including genre information extracted from classification algorithms.<br />
#Make recommendations without genre information.<br />
<br />
The authors used convolutional recurrent neural networks (CRNN), which is a combination of convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.<br />
<br />
==Methods and Techniques:==<br />
Generally, a music recommender can be divided into three main parts: (I) users, (ii) items, and (iii) user-item matching algorithms. First, we generated users' music tastes based on their profiles. Second, item profiling includes editorial, cultural, and acoustic metadata were collected for listeners' satisfaction. Finally, we come to the matching algorithm that suggests recommended personalized music to listeners. <br />
<br />
To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models. <br />
=== Mel Scale: === <br />
The scale of pitches that are heard by listeners, which translates to equal pitch increments.<br />
<br />
[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]<br />
<br />
=== Short Time Fourier Transform (STFT): ===<br />
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math><br />
<br />
where: <math>w(\tau)</math> is the Hanning smoothing function<br />
<br />
=== Convolutional Neural Network (CNN): ===<br />
Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data and with kernels, and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is also used to maintain the data on the edges. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.<br />
<br />
[[File:Convolution.png|thumb|400px|left|Convolution Operation]]<br />
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]<br />
<br />
=== Convolutional Recurrent Neural Network (CRNN): === <br />
Similar Neural Network as CNN, with the addition of a GRU, which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. A Gated Recurrent Unit (GRU) is used to store more long-term memory and will help train the early hidden layers.<br />
<br />
[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]<br />
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]<br />
<br />
==Data Screening:==<br />
<br />
The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contain 16 different genres. The data is cleaned up by removing low audio quality songs, wrong labelled genre and those that has multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model. <br />
<br />
[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]<br />
<br />
==Implementation:==<br />
<br />
=== Modeling Neural Networks ===<br />
<br />
As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy. <br />
<br />
In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use Adam optimization to update weights in the training phase. Binary cross-entropy was used as the loss function. <br />
<br />
In both the CNN and CRNN models, the data was trained over 100 epochs with a binary cross-entropy loss function. The sigmoid function was used as the output layer. <br />
<br />
<br />
An overview of the CNN and CRNN architecture can be found in the charts below.<br />
<br />
[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]<br />
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]<br />
<br />
=== Music Recommendation System ===<br />
<br />
The recommendation system is computed by the cosine similarity of the extraction features from the neural network. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The featured variables will be used in the cosine similarity to find the best recommendations. <br />
<br />
The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.<br />
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]<br />
<br />
== Evaluation Metrics ==<br />
=== Precision: ===<br />
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)<br />
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre<br />
<br />
=== Recall: ===<br />
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)<br />
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre<br />
<br />
=== F1-Score: ===<br />
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:<br />
<br />
[[File:F1441.png|frame|100px|none|F1-Score]]<br />
<br />
=== Receiver operating characteristics (ROC): ===<br />
* A graphical metric that is used to assess a classification model at different classification thresholds <br />
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k<br />
* Plots the true positive rate versus false positive rate as the classification threshold is varied<br />
<br />
[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]<br />
<br />
=== Area Under the Curve (AUC) ===<br />
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.<br />
<br />
In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.<br />
<br />
[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]<br />
<br />
== Results ==<br />
=== Accuracy Metrics ===<br />
The table below is the accuracy metrics with the classification threshold of 0.5.<br />
<br />
[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]<br />
On average, CRNN outperforms CNN in true positive and false positive cases.<br />
<br />
<br />
[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]<br />
On average, CRNN outperforms CNN in F1-score. <br />
<br />
<br />
[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]<br />
On average, CRNN also outperforms CNN in AUC metric.<br />
<br />
<br />
CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC when comparing to CNN classifier.<br />
<br />
=== Evaluation of Music Recommendation System: ===<br />
<br />
* A listening experiment was performed with 30 participants to access user responses to given music recommendations.<br />
* Participants choose 5 pieces of music they enjoyed and the recommender system generated 5 new recommendations. The participants then evaluated the music recommendation by recording whether the song was liked or disliked.<br />
* The recommendation system takes two approaches to the recommendation:<br />
** Method one uses only the value of cosine similarity.<br />
** Method two uses the value of cosine similarity and information on music genre.<br />
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:<br />
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]<br />
<br />
Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.<br />
<br />
== Conclusion: ==<br />
<br />
Here are two main conclusions obtained from this paper:<br />
<br />
- To increase the predictive capabilities of the music recommendation system, the music genre should be a key feature to analyze.<br />
<br />
- To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals. <br />
<br />
According to analyses in the paper, the authors also suggested adding other music features like tempo gram for capturing local tempo to improve the accuracy of the recommender system.<br />
<br />
== Critiques/ Insights: ==<br />
# The authors fail to give reference to the performance of current recommendation algorithms used in the industry; my critique would be for the authors to bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.<br />
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities. <br />
# They could go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network.<br />
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie. <br />
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed. <br />
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.<br />
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.<br />
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.<br />
# according to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?<br />
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.<br />
# This paper is very similar to another [https://link.springer.com/chapter/10.1007/978-3-319-46131-1_29 paper], written by Bruce Fewerda and Markus Schedl. Both papers are suggesting methods of building music recommendation systems. However, this paper recommends music based on genre, but the paper written by Fewerda and Schedl suggests a personality-based user modeling for music recommender systems.<br />
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other catogories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to diffentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?<br />
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=47994F21-STAT 441/841 CM 763-Proposal2020-11-30T00:31:27Z<p>Hhalim: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
Budaraju, Sai Arvind<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will take the "Riiid! Answer Correctness Prediction" Kaggle competition. We will predict students' performances on a particular question based on their historic performance. The performance of other students on this question and the information about the question itself (like its difficulty, length, etc). https://www.kaggle.com/c/riiid-test-answer-prediction/overview<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting (2018) <br />
'''Description:'''<br />
We will be reproducing the results of "A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting" by Cannas and Arpino (2018) and applying the results to a new dataset, Right Heart Catheterization (RHC) which includes data from the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT), for comparison. This paper uses simulated data and several machine learning algorithms to estimate causal effects in observational studies. The machine learning methods used include CART, Bagging, Boosting, Random Forest, Neural Networks, and Naive Bayes. There are also several variations of measures of covariate balancing used in the study. The importance of tuning the machine learning algorithms' hyperparameters is also investigated with respect to propensity score estimation. <br />
<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 32 Group members:''' <br />
<br />
Taohao Wang<br />
Zeren Shen<br />
Zihao Guo<br />
Rui Chen<br />
<br />
'''Title:''' Google Landmark Recognition 2020<br />
<br />
'''Description:'''<br />
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,<br />
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.<br />
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very<br />
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several <br />
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time, <br />
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 33 Group members:''' <br />
<br />
Hansa Halim<br />
<br />
Sanjana Rajendra Naik<br />
<br />
Samka Marfua<br />
<br />
Shawrupa Proshasty<br />
<br />
'''Title:''' Jane Street Market Prediction Kaggle Competition<br />
<br />
'''Description:'''<br />
Our team will participate in the Jane Street Market Prediction Competition on Kaggle. We will create a model that involves time series to give a prediction to either execute a trade (1) or not (0) on real-time market prices during live trading hours. The model we create will be submitted through an API and will be tested and scored by Kaggle using real-time market data so that means we cannot submit predictions on past market data and that our model is evaluated on future data. <br />
<br />
Link to Kaggle Competition: [https://www.kaggle.com/c/jane-street-market-prediction/ Jane Street Market Prediction]</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=47993F21-STAT 441/841 CM 763-Proposal2020-11-30T00:29:59Z<p>Hhalim: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
Budaraju, Sai Arvind<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will take the "Riiid! Answer Correctness Prediction" Kaggle competition. We will predict students' performances on a particular question based on their historic performance. The performance of other students on this question and the information about the question itself (like its difficulty, length, etc). https://www.kaggle.com/c/riiid-test-answer-prediction/overview<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting (2018) <br />
'''Description:'''<br />
We will be reproducing the results of "A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting" by Cannas and Arpino (2018) and applying the results to a new dataset, Right Heart Catheterization (RHC) which includes data from the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT), for comparison. This paper uses simulated data and several machine learning algorithms to estimate causal effects in observational studies. The machine learning methods used include CART, Bagging, Boosting, Random Forest, Neural Networks, and Naive Bayes. There are also several variations of measures of covariate balancing used in the study. The importance of tuning the machine learning algorithms' hyperparameters is also investigated with respect to propensity score estimation. <br />
<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 32 Group members:''' <br />
<br />
Taohao Wang<br />
Zeren Shen<br />
Zihao Guo<br />
Rui Chen<br />
<br />
'''Title:''' Google Landmark Recognition 2020<br />
<br />
'''Description:'''<br />
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,<br />
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.<br />
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very<br />
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several <br />
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time, <br />
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 33 Group members:''' <br />
<br />
Hansa Halim<br />
<br />
Sanjana Rajendra Naik<br />
<br />
Samka Marfua<br />
<br />
Shawrupa Proshasty<br />
<br />
'''Title:''' Jane Street Market Prediction Kaggle Competition<br />
<br />
'''Description:'''<br />
Our team will participate in the Jane Street Market Prediction Competition on Kaggle. We will create a model that involves time series to give a prediction of either to execute a trade (1) or not (0) on real-time market. The model we create will be submitted through an API and will be tested and scored by Kaggle using real-time market data so that means we cannot submit predictions on past market data. <br />
<br />
Link to Kaggle Competition: [https://www.kaggle.com/c/jane-street-market-prediction/ Jane Street Market Prediction]</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:T358wang&diff=46217User:T358wang2020-11-24T16:47:14Z<p>Hhalim: /* Introduction */</p>
<hr />
<div><br />
== Group ==<br />
Rui Chen, Zeren Shen, Zihao Guo, Taohao Wang<br />
<br />
== Introduction ==<br />
<br />
Landmark recognition is an image retrieval task with its own specific challenges. This paper provides a new and effective method to recognize landmark images, which has been successfully applied to actual images. In this way, statues, buildings, and characteristic objects can be effectively identified.<br />
<br />
There are many difficulties encountered in the development process:<br />
<br />
1. The first problem is that the concept of landmarks cannot be strictly defined, because landmarks can be any object and building.<br />
<br />
2. The second problem is that the same landmark can be photographed from different angles. The results of the multi-angle shooting will result in very different picture characteristics. But the system needs to accurately identify different landmarks. We may also need to consider and angle which has the interior of a building versus the exterior of it, a good model will be able to recognize both.<br />
<br />
3. The third problem is that the landmark recognition system must recognize a large number of landmarks, and the recognition must achieve high accuracy. The challenge here is that there are significantly more objects in the world that is not a landmark.<br />
<br />
These problems require that the system should have a very low false alarm rate and high recognition accuracy. <br />
There are also three potential problems:<br />
<br />
1. The processed data set contains a little error content, the image content is not clean and the quantity is huge.<br />
<br />
2. The algorithm for learning the training set must be fast and scalable.<br />
<br />
3. While displaying high-quality judgment landmarks, there is no image geographic information mixed.<br />
<br />
The article describes the deep convolutional neural network (CNN) architecture, loss function, training method, and inference aspects. Using this model, similar metrics to the state of the art model in the test were obtained and the inference time was found to be 15 times faster. Further, because of the efficient architecture, the system can serve in an online fashion. The results of quantitative experiments will be displayed through testing and deployment effect analysis to prove the effectiveness of the model.<br />
<br />
== Related Work ==<br />
<br />
Landmark recognition can be regarded as one of the tasks of image retrieval, and a large number of documents are concentrated on image retrieval tasks. In the past two decades, the field of image retrieval has made significant progress, and the main methods can be divided into two categories. <br />
The first is a classic retrieval method using local features, a method based on local feature descriptors organized in bag-of-words, spatial verification, Hamming embedding, and query expansion. These methods are dominant in image retrieval. Later, until the rise of deep convolutional neural networks (CNN), deep convolutional neural networks (CNN) were used to generate global descriptors of input images.<br />
Another method is to selectively match the kernel Hamming embedding method extension. With the advent of deep convolutional neural networks, the most effective image retrieval method is based on training CNNs for specific tasks. Deep networks are very powerful for semantic feature representation, which allows us to effectively use them for landmark recognition. This method shows good results but brings additional memory and complexity costs. <br />
The DELF (DEep local feature) by Noh et al. proved promising results. This method combines the classic local feature method with deep learning. This allows us to extract local features from the input image and then use RANSAC for geometric verification. Random Sample Consensus (RANSAC) is a method to smooth data containing a significant percentage of errors, which is ideally suited for applications in automated image analysis where interpretation is based on the data generated by error-prone feature detectors. The goal of the project is to describe a method for accurate and fast large-scale landmark recognition using the advantages of deep convolutional neural networks.<br />
<br />
== Methodology ==<br />
<br />
This section will describe in detail the CNN architecture, loss function, training procedure, and inference implementation of the landmark recognition system. The figure below is an overview of the landmark recognition system.<br />
<br />
[[File:t358wang_landmark_recog_system.png |center|800px]]<br />
<br />
The landmark CNN consists of three parts including the main network, the embedding layer, and the classification layer. To obtain a CNN main network suitable for training landmark recognition model, fine-tuning is applied and several pre-trained backbones (Residual Networks) based on other similar datasets, including ResNet-50, ResNet-200, SE-ResNext-101, and Wide Residual Network (WRN-50-2), are evaluated based on inference quality and efficiency. Based on the evaluation results, WRN-50-2 is selected as the optimal backbone architecture.<br />
<br />
[[File:t358wang_backbones.png |center|600px]]<br />
<br />
For the embedding layer, as shown in the below figure, the last fully-connected layer after the averaging pool is removed. Instead, a fully-connected 2048 <math>\times</math> 512 layer and a batch normalization are added as the embedding layer. After the batch norm, a fully-connected 512 <math>\times</math> n layer is added as the classification layer. The below figure shows the overview of the CNN architecture of the landmark recognition system.<br />
<br />
[[File:t358wang_network_arch.png |center|800px]]<br />
<br />
To effectively determine the embedding vectors for each landmark class (centroids), the network needs to be trained to have the members of each class to be as close as possible to the centroids. Several suitable loss functions are evaluated including Contrastive Loss, Arcface, and Center loss. The center loss is selected since it achieves the optimal test results and it trains a center of embeddings of each class and penalizes distances between image embeddings as well as their class centers. In addition, the center loss is a simple addition to softmax loss and is trivial to implement.<br />
<br />
When implementing the loss function, a new additional class that includes all non-landmark instances needs to be added and the center loss function needs to be modified as follows: Let n be the number of landmark classes, m be the mini-batch size, <math>x_i \in R^d</math> is the i-th embedding and <math>y_i</math> is the corresponding label where <math>y_i \in</math> {1,...,n,n+1}, n+1 is the label of the non-landmark class. Denote <math>W \in R^{d \times n}</math> as the weights of the classifier layer, <math>W_j</math> as its j-th column. Let <math>c_{y_i}</math> be the <math>y_i</math> th embeddings center from Center loss and <math>\lambda</math> be the balancing parameter of Center loss. Then the final loss function will be: <br />
<br />
[[File:t358wang_loss_function.png |center|600px]]<br />
<br />
In the training procedure, the stochastic gradient descent(SGD) will be used as the optimizer with momentum=0.9 and weight decay = 5e-3. For the center loss function, the parameter <math>\lambda</math> is set to 5e-5. Each image is resized to 256 <math>\times</math> 256 and several data augmentations are applied to the dataset including random resized crop, color jitter and random flip. The training dataset is divided into four parts based on the geographical affiliation of cities where landmarks are located: Europe/Russia, North America/Australia/Oceania, Middle East/North Africa, and the Far East. <br />
<br />
The paper introduces curriculum learning for landmark recognition, which is shown in the below figure. The algorithm is trained for 30 epochs and the learning rate <math>\alpha_1, \alpha_2, \alpha_3</math> will be reduced by a factor of 10 at the 12th epoch and 24th epoch.<br />
<br />
[[File:t358wang_algorithm1.png |center|600px]]<br />
<br />
In the inference phase, the paper introduces the term “centroids” which are embedding vectors that are calculated by averaging embeddings and are used to describe landmark classes. The calculation of centroids is significant to effectively determine whether a query image contains a landmark. The paper proposes two approaches to help the inference algorithm to calculate the centroids. First, instead of using the entire training data for each landmark, data cleaning is done to remove most of the redundant and irrelevant elements in the image. Second, since each landmark can have different shooting angles, it is more efficient to calculate a separate centroid for each shooting angle. Hence, a hierarchical agglomerative clustering algorithm is proposed to partition training data into several valid clusters for each landmark and the set of centroids for a landmark L can be represented by <math>\mu_{l_j} = \frac{1}{v} \sum_{i \in C_j} x_i, j \in 1,...,v</math> where v is the number of valid clusters for landmark L. <br />
<br />
Once the centroids are calculated for each landmark class, the system can make decisions whether there is any landmark in an image. The query image is passed through the landmark CNN and the resulting embedding vector is compared with all centroids by dot product similarity using approximate k-nearest neighbors (AKNN). To distinguish landmark classes from non-landmark, a threshold <math>\eta</math> is set and it will be compared with the maximum similarity to determine if the image contains any landmarks.<br />
<br />
The full inference algorithm is described in the below figure.<br />
<br />
[[File:t358wang_algorithm2.png |center|600px]]<br />
<br />
== Experiments and Analysis ==<br />
<br />
'''Offline test'''<br />
<br />
In order to measure the quality of the model, an offline test set was collected and manually labeled. According to the calculations, photos containing landmarks make up 1 − 3% of the total number of photos on average. This distribution was emulated in an offline test, and the geo-information and landmark references weren’t used. <br />
The results of this test are presented in the table below. Two metrics were used to measure the results of experiments: Sensitivity — the accuracy of a model on images with landmarks (also called Recall) and Specificity — the accuracy of a model on images without landmarks. Several types of DELF were evaluated, and the best results in terms of sensitivity and specificity were included in the table below. The table also contains the results of the model trained only with Softmax loss, Softmax, and Center loss. Thus, the table below reflects improvements in our approach with the addition of new elements in it.<br />
<br />
[[File:t358wang_models_eval.png |center|600px]]<br />
<br />
It’s very important to understand how a model works on “rare” landmarks due to the small amount of data for them. Therefore, the behavior of the model was examined separately on “rare” and “frequent” landmarks in the table below. The column “Part from total number” shows what percentage of landmark examples in the offline test has the corresponding type of landmarks. And we find that the sensitivity of “frequent” landmarks are much higher than “rare” landmarks.<br />
<br />
[[File:t358wang_rare_freq.png |center|600px]]<br />
<br />
Analysis of the behavior of the model in different categories of landmarks in the offline test is presented in the table below. These results show that the model can successfully work with various categories of landmarks. Predictably better results (92% of sensitivity and 99.5% of specificity) could also be obtained when the offline test with geo-information was launched on the model.<br />
<br />
[[File:t358wang_landmark_category.png |center|600px]]<br />
<br />
'''Revisited Paris dataset'''<br />
<br />
Revisited Paris dataset (RPar)[2] was also used to measure the quality of the landmark recognition approach. This dataset with Revisited Oxford (ROxf) is standard benchmarks for the comparison of image retrieval algorithms. In the recognition, it is important to determine the landmark, which is contained in the query image. Images of the same landmark can have different shooting angles or taken inside/outside the building. Thus, it is reasonable to measure the quality of the model in the standard and adapt it to the task settings. That means not all classes from queries are presented in the landmark dataset. For those images containing correct landmarks but taken from different shooting angles within the building, we transferred them to the “junk” category, which does not influence the final score and makes the test markup closer to our model’s goal. Results on RPar with and without distractors in medium and hard modes are presented in the table below. <br />
<br />
[[File:t358wang_methods_eval1.png |center|600px]]<br />
<br />
[[File:t358wang_methods_eval2.png |center|600px]]<br />
<br />
== Comparison ==<br />
<br />
Recent most efficient approaches to landmark recognition are built on fine-tuned CNN. We chose to compare our method to DELF on how well each performs on recognition tasks. A brief summary is given below:<br />
<br />
[[File:t358wang_comparison.png |center|600px]]<br />
<br />
''' Offline test and timing '''<br />
<br />
Both approaches obtained similar results for image retrieval in the offline test (shown in the sensitivity&specificity table), but the proposed approach is much faster on the inference stage and more memory efficient.<br />
<br />
To be more detailed, during the inference stage, DELF needs more forward passes through CNN, has to search the entire database, and performs the RANSAC method for geometric verification. All of them make it much more time-consuming than our proposed approach. Our approach mainly uses centroids, this makes it take less time and needs to store fewer elements.<br />
<br />
== Conclusion ==<br />
<br />
In this paper we were hoping to solve some difficulties emerging when trying to apply landmark recognition to the production level: there might not be a clean & sufficiently large database for interesting tasks, algorithms should be fast, scalable, and should aim for low FP and high accuracy.<br />
<br />
While aiming for these goals, we presented a way of cleaning landmark data. And most importantly, we introduced the usage of embeddings of deep CNN to make recognition fast and scalable, trained by curriculum learning techniques with modified versions of Center loss. Compared to the state-of-the-art methods, this approach shows similar results but is much faster and suitable for implementation on a large scale.<br />
<br />
== Critique ==<br />
The paper selected 5 images per landmark and checked them manually. That means the training process takes a long time on data cleaning and so the proposed algorithm is not lacks reusability. Also, since only the landmarks that are the largest and most popular were used to train the CNN, the trained model will probably be used in big cities instead of some minor cities with less popular landmarks.<br />
<br />
== References ==<br />
[1] Andrei Boiarov and Eduard Tyantov. 2019. Large Scale Landmark Recognition via Deep Metric Learning. In The 28th ACM International Conference on Information and Knowledge Management (CIKM ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 10 pages. https://arxiv.org/pdf/1908.10192.pdf 3357384.3357956<br />
<br />
[2] FilipRadenović,AhmetIscen,GiorgosTolias,YannisAvrithis,andOndřejChum.<br />
2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking.<br />
arXiv preprint arXiv:1803.11285 (2018).</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou&diff=46131User:J46hou2020-11-23T21:33:54Z<p>Hhalim: /* Critiques/Insights */</p>
<hr />
<div>DROCC: Deep Robust One-Class Classification<br />
== Presented by == <br />
Jinjiang Lian, Yisheng Zhu, Jiawen Hou, Mingzhe Huang<br />
== Introduction ==<br />
In this work, we study “one-class” classification, where the goal is to obtain accurate discriminators for a special class. Popular uses of this technique include anomaly detection where we are interested in detecting outliers. Anomaly detection is a well-studied area of research, however the conventional approach of modelling with typical data using a simple function falls short when it comes to complex domains such as vision or speech. Another case where this would be useful is when recognizing “wake-word” while waking up AI systems such as Alexa. In this work, we are presenting a new approach called Deep Robust One Class Classification (DROCC). DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. More specifically, we are presenting DROCC-LF which is an outlier-exposure style extension of DROCC. This extension combines the DROCC's anomaly detection loss with standard classification loss over the negative data.<br />
<br />
== Previous Work ==<br />
The current state of art methodology to tackle this kind of problems are: <br />
1. Approach based on prediction transformations (Golan & El-Yaniv, 2018; Hendrycks et al.,2019a) [1] This approach has some short coming in the sense that it depends heavily on an appropriate domain-specific set of transformations that are in general hard to obtain. <br />
2. Approach of minimizing a classical one-class loss on the learned final layer representations such as DeepSVDD. (Ruff et al.,2018)[2] This method suffers from the fundamental drawback of representation collapse where the model is no longer being able to accurately recognize the feature representations. <br />
== Motivation ==<br />
Anomaly detection is a well-studied problem with a large body of research (Aggarwal, 2016; Chandola et al., 2009) [3]. Classical approaches for anomaly detection are based on modeling the typical data using simple functions over the inputs (Sch¨olkopf et al., 1999; Liu et al., 2008; Lakhina et al., 2004) [4], such as constructing a minimum-enclosing ball around the typical data points (Tax & Duin, 2004) [5]. While these techniques are well-suited when the input is featurized appropriately, they struggle on complex domains like vision and speech, where hand-designing features is difficult.<br />
DROCC is robust to representation collapse by involving a discriminative component that is general and empirically accurate on most standard domains like tabular, time-series and vision without requiring any additional side information. DROCC is motivated by the key observation that generally, the typical data lies on a low-dimensional manifold, which is well-sampled in the training data. This is believed to be true even in complex domains such as vision, speech, and natural language (Pless & Souvenir, 2009). [6]<br />
== Model Explanation ==<br />
[[File:drocc_f1.jpg | center]]<br />
<div align="center">Figure 1</div><br />
(a) A normal data manifold with red dots representing generated anomalous points in Ni(r). <br />
<br />
(b) Decision boundary learned by DROCC when applied to the data from (a). Blue represents points classified as normal and red points are classified as abnormal. <br />
<br />
(c), (d): first two dimensions of the decision boundary of DROCC and DROCC–LF, when applied to noisy data (Section 5.2). DROCC–LF is nearly optimal while DROCC’s decision boundary is inaccurate. Yellow color sine wave depicts the train data.<br />
<br />
== DROCC ==<br />
The model is based on the assumption that the true data lines on a manifold. As manifolds resemble Euclidean space locally, our discriminative component is based on classifying a point as anomalous if it is outside the union of small L2 norm balls around the training typical points (See Figure 1a, 1b for an illustration). Importantly, the above definition allows us to synthetically generate anomalous points, and we adaptively generate the most effective anomalous points while training via a gradient ascent phase reminiscent of adversarial training. In other words, DROCC has a gradient ascent phase to adaptively add anomalous points to our training set and a gradient descent phase to minimize the classification loss by learning a representation and a classifier on top of the representations to separate typical points from the generated anomalous points. In this way, DROCC automatically learns an appropriate representation (like DeepSVDD) but is robust to a representation collapse as mapping all points to the same value would lead to poor discrimination between normal points and the generated anomalous points.<br />
== DROCC-LF ==<br />
To especially tackle problems such as anomaly detection and outlier exposure (Hendrycks et al., 2019a) [7] We propose DROCC–LF, an outlier-exposure style extension of DROCC. Intuitively, DROCC–LF combines DROCC’s anomaly detection loss (that is over only the positive data points) with standard classification loss over the negative data. But, in addition, DROCC–LF exploits the negative examples to learn a Mahalanobis distance to compare points over the manifold instead of using the standard Euclidean distance, which can be inaccurate for high-dimensional data with relatively fewer samples. (See Figure 1c, 1d for illustration)<br />
<br />
== Popular Dataset Benchmark Result ==<br />
<br />
[[File:drocc_auc.jpg | center]]<br />
<div align="center">Figure 2: AUC result</div><br />
<br />
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. The average AUC (with standard deviation) for one-vs-all anomaly detection on CIFAR-10 is shown in table 1. DROCC outperforms baselines on most classes, with gains as high at 20%, and notably, nearest neighbors (NN) beats all the baselines on 2 classes.<br />
<br />
[[File:drocc_f1score.jpg | center]]<br />
<div align="center">Figure 3: F1-Score</div><br />
<br />
Table 2 shows F1-Score (with standard deviation) for one-vs-all anomaly detection on Thyroid, Arrhythmia, and Abalone datasets from UCI Machine Learning Repository. DROCC outperforms the baselines on all the three datasets by a minimum of 0.07 which is about 11.5% performance increase.<br />
Results on One-class Classification with Limited Negatives (OCLN): <br />
[[File:ocln.jpg | center]]<br />
<div align="center">Figure 4: Sample postives, negatives and close negatives for MNIST digit 0 vs 1 experiment (OCLN).</div><br />
MNIST 0 vs. 1 Classification: <br />
We consider an experimental setup on MNIST dataset, where the training data consists of the Digit 0, the normal class, and the Digit 1 as the anomaly. During evaluation, in addition to samples from training distribution, we also have half zeros, which act as challenging OOD points (close negatives). These half zeros are generated by randomly masking 50% of the pixels (Figure 2). BCE performs poorly, with a recall of 54% only at a fixed FPR of 3%. DROCC–OE gives a recall value of 98:16% outperforming DeepSAD by a margin of 7%, which gives a recall value of 90:91%. DROCC–LF provides further improvement with a recall of 99:4% at 3% FPR. <br />
<br />
[[File:ocln_2.jpg | center]]<br />
<div align="center">Figure 5: OCLN on Audio Commands.</div><br />
Wake word Detection: <br />
Finally, we evaluate DROCC–LF on the practical problem of wake word detection with low FPR against arbitrary OOD negatives. To this end, we identify a keyword, say “Marvin” from the audio commands dataset (Warden, 2018) [8] as the positive class, and the remaining 34 keywords are labeled as the negative class. For training, we sample points uniformly at random from the above-mentioned dataset. However, for evaluation, we sample positives from the train distribution, but negatives contain a few challenging OOD points as well. Sampling challenging negatives itself is a hard task and is the key motivating reason for studying the problem. So, we manually list close-by keywords to Marvin such as: Mar, Vin, Marvelous etc. We then generate audio snippets for these keywords via a speech synthesis tool 2 with a variety of accents.<br />
Figure 3 shows that for 3% and 5% FPR settings, DROCC–LF is significantly more accurate than the baselines. For example, with FPR=3%, DROCC–LF is 10% more accurate than the baselines. We repeated the same experiment with the keyword: Seven, and observed a similar trend. In summary, DROCC–LF is able to generalize well against negatives that are “close” to the true positives even when such negatives were not supplied with the training data.<br />
<br />
== Conclusion and Future Work ==<br />
We introduced DROCC method for deep anomaly detection. It models normal data points using a low-dimensional manifold, and hence can compare close point via Euclidean distance. Based on this intuition, DROCC’s optimization is formulated as a saddle point problem which is solved via standard gradient descent-ascent algorithm. We then extended DROCC to OCLN problem where the goal is to generalize well against arbitrary negatives, assuming positive class is well sampled and a small number of negative points are also available. Both the methods perform significantly better than strong baselines, in their respective problem settings. <br />
<br />
For computational efficiency, we simplified the projection set for both the methods which can perhaps slow down the convergence of the two methods. Designing optimization algorithms that can work with the stricter set is an exciting research direction. Further, we would also like to rigorously analyze DROCC, assuming enough samples from a low-curvature manifold. Finally, as OCLN is an exciting problem that routinely comes up in a variety of real-world applications, we would like to apply DROCC–LF to a few high impact scenarios.<br />
<br />
The results of this study showed that DROCC is comparatively better for anomaly detection across many different areas, such as tabular data, images, audio, and time series, when compared to existing state-of-the-art techniques.<br />
<br />
It would be interesting to see how the DROCC method performs in situations where the anomaly is very rare, say detecting signals of volacanic explosion from seismic activity data. Such challening anomalous situations will be a test of endurance for this method and can even help advance work in this area.<br />
<br />
== References ==<br />
[1]: Golan, I. and El-Yaniv, R. Deep anomaly detection using geometric transformations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.<br />
<br />
[2]: Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., M¨uller, E., and Kloft, M. Deep one-class classification. In International Conference on Machine Learning (ICML), 2018.<br />
<br />
[3]: Aggarwal, C. C. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.<br />
<br />
[4]: Sch¨olkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, 1999.<br />
<br />
[5]: Tax, D. M. and Duin, R. P. Support vector data description. Machine Learning, 54(1), 2004.<br />
<br />
[6]: Pless, R. and Souvenir, R. A survey of manifold learning for images. IPSJ Transactions on Computer Vision and Applications, 1, 2009.<br />
<br />
[7]: Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations (ICLR), 2019a.<br />
<br />
[8]: Warden, P. Speech commands: A dataset for limited vocabulary speech recognition, 2018. URL https: //arxiv.org/abs/1804.03209.<br />
<br />
== Critiques/Insights ==<br />
<br />
1. It would be interesting to see this implemented in self driving cars, for instance to detect unusual road conditions.<br />
<br />
2. Figure 1 shows a good representation on how this model works. However, how can we know that this model is not prone to overfitting? There are many situations where there are valid points that lie outside of the line, especially new data that the model has never see before. An explanation on how this is avoided would be good.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan&diff=46130User:Yktan2020-11-23T21:18:44Z<p>Hhalim: /* Results */ Duplicate sentence.</p>
<hr />
<div>== Presented by == <br />
Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew<br />
<br />
== Introduction ==<br />
<br />
Much of the success in training deep neural networks (DNNs) is thanks to the collection of large datasets with human annotated labels. However, human annotation is both a time-consuming and expensive task, especially for data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators.<br />
<br />
There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples.<br />
<br />
This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are:<br />
1) Co-divide, which trains two networks simultaneously, with the aim of improving generalization and avoiding confirmation bias.<br />
2) During SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp).<br />
3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.<br />
<br />
== Motivation ==<br />
<br />
While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously. <br />
<br />
Existing LNL methods aim to correct the loss function by:<br />
<ol><br />
<li> Treating all samples equally and correcting loss explicitly or implicitly through relabeling of the noisy samples<br />
<li> Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function<br />
</ol><br />
<br />
A few examples of LNL methods include:<br />
<ol><br />
<li> Estimating the noise transition matrix to correct the loss function<br />
<li> Leveraging DNNs to correct labels and using them to modify the loss<br />
<li> Reweighting samples so that noisy labels contribute less to the loss<br />
</ol><br />
<br />
However, these methods each have some downsides. For example, it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; for the third method, we need to be able to identify clean samples, which has also proven to be challenging.<br />
<br />
On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch incorporates the two classes of regularization – consistency regularization and entropy minimization as well as MixUp regularization. <br />
<br />
DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (SSL) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples and performing well under high levels of noise.<br />
<br />
== Model Architecture ==<br />
<br />
DivideMix leverages semi-supervised learning to achieve effective modelling. The sample is first split into a labelled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Figure 1 describes the algorithm in graphical form.<br />
<br />
[[File:ModelArchitecture.PNG | center]]<br />
<br />
<div align="center">Figure 1: Model Architecture of DivideMix</div><br />
<br />
For each epoch, the network divides the dataset into a labelled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label <math> y_b </math>, the predicted label <math> p_b </math>, and the posterior is used as the weight, <math> w_b </math>. <br />
<br />
<center><math> \bar{y}_b = w_b y_b + (1-w_b) p_b </math></center> <br />
<br />
Then, a sharpening function is implemented on this weighted sum to produce the estimate, <math> \hat{y}_b </math>. Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, <math> \hat{X} </math> and unlabeled data, <math> \hat{U} </math>, where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples <math> (x_1,x_2) </math> and their labels <math> (p_1,p_2) </math>, the mixed sample <math> (x',p') </math> is:<br />
<br />
<center><br />
<math><br />
\begin{alignat}{2}<br />
<br />
\lambda &\sim Beta(\alpha, \alpha) \\<br />
\lambda ' &= max(\lambda, 1 - \lambda) \\<br />
x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\<br />
p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\<br />
<br />
\end{alignat}<br />
</math><br />
</center> <br />
<br />
MixMatch transforms <math> \hat{X} </math> and <math> \hat{U} </math> into <math> X' </math> and <math> U' </math>. Then, the loss on <math> X' </math>, <math> L_X </math> (Cross-entropy loss) and the loss on <math> U' </math>, <math> L_U </math> (Mean Squared Error) are calculated. A regularization term, <math> L_{reg} </math>, is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:<br />
<br />
<center><math> L = L_X + \lambda_u L_U + \lambda_r L_{reg} </math></center> ,<br />
<br />
where <math> \lambda_r </math> is set to 1, and <math> \lambda_u </math> is used to control the unsupervised loss.<br />
<br />
Lastly, the stochastic gradient descent formula is updated with the calculated loss, <math> L </math>, and the estimated parameters, <math> \boldsymbol{ \theta } </math>.<br />
<br />
== Results ==<br />
'''Applications'''<br />
<br />
There are four datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009)(both contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a).<br />
Two types of label noise are used in the experiments: symmetric and asymmetric.<br />
An 18-layer PreAct Resnet (He et al., 2016)is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set as 0.02, and reduce it by a factor of 10 after 150 epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.<br />
<br />
<br />
'''Comparison of State-of-the-Art Methods'''<br />
<br />
The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: <br />
Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; <br />
Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters;<br />
M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp.<br />
The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:<br />
<br />
<br />
[[File:divideMixtable1.PNG | center]]<br />
<br />
From table1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% in accuracy) for the more challenging CIFAR-100 with high noise ratios.<br />
<br />
DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy. <br />
<br />
<br />
'''Ablation Study'''<br />
<br />
The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.<br />
<br />
<br />
[[File:DivideMixtable5.PNG | center]]<br />
<br />
The authors find that both label refinement and input augmentation are beneficial for DivideMix.<br />
<br />
== Conclusion ==<br />
<br />
This paper provides a new and effective algorithm for learning with noisy labels by leveraging SSL. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labelling effectively, therefore it is a robust approach to dealing with noise in datasets. DivideMix has also been tested using various datasets with the results consistently being one of the best when compared to other advanced methods.<br />
<br />
Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.<br />
<br />
== References ==<br />
Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised<br />
label noise modeling and loss correction. In ICML, pp. 312–321, 2019.<br />
<br />
David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin<br />
Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.<br />
<br />
Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach<br />
to learning from noisy labels. In WACV, pp. 1215–1224, 2018.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN&diff=46128Neural Speed Reading via Skim-RNN2020-11-23T20:58:28Z<p>Hhalim: /* Critiques */</p>
<hr />
<div>== Group ==<br />
<br />
Mingyan Dai, Jerry Huang, Daniel Jiang<br />
<br />
== Introduction ==<br />
<br />
In Natural Language Processing, recurrent neural networks (RNNs) are a common architecture used to sequentially ‘read’ input tokens and output a distributed representation for each token. By recurrently updating the hidden state of the neural network, a RNN can inherently require the same computational cost across time. However, when it comes to processing input tokens, it is usually the case that some tokens are less important to the overall representation of a piece of text or a query when compared to others. In particular, when considering question answering, many times the neural network will encounter parts of a passage that is irrelevant when it comes to answering a query that is being asked. <br />
<br />
== Model ==<br />
<br />
In this paper, the authors introduce a model called 'skim-RNN', which takes advantage of ‘skimming’ less important tokens or pieces of text rather than ‘skipping’ them entirely. This models the human ability to skim through passages, or to spend less time reading parts do not affect the reader’s main objective. While this leads to a loss in the comprehension rate of the text [1], it greatly reduces the amount of time spent reading by not focusing on areas which will not significantly affect efficiency when it comes to the reader's objective.<br />
<br />
'Skim-RNN' works by rapidly determining the significance of each input and spending less time processing unimportant input tokens by using a smaller RNN to update only a fraction of the hidden state. When the decision is to ‘fully read’, that is to not skim the text, Skim-RNN updates the entire hidden state with the default RNN cell. Since the hard decision function (‘skim’ or ‘read’) is non-differentiable, the authors use a gumbel-softmax [2] to estimate the gradient of the function, rather than traditional methods such as REINFORCE (policy gradient)[3]. The switching mechanism between the two RNN cells enables Skim-RNN to reduce the total number of float operations (Flop reduction, or Flop-R). When the skimming rate is high, which often leads to faster inference on CPUs, which makes it very useful for large-scale products and small devices.<br />
<br />
The Skim-RNN has the same input and output interfaces as standard RNNs, so it can be conveniently used to speed up RNNs in existing models. In addition, the speed of Skim-RNN can be dynamically controlled at inference time by adjusting a parameter for the threshold for the ‘skim’ decision.<br />
<br />
=== Implementation ===<br />
<br />
A Skim-RNN consists of two RNN cells, a default (big) RNN cell of hidden state size <math>d</math> and small RNN cell of hidden state size <math>d'</math>, where <math>d</math> and <math>d'</math> are parameters defined by the user and <math>d \ll d'</math>. This follows the fact that there should be a small RNN cell defined for when text is meant to be skimmed and a larger one for when the text should be processed as normal.<br />
<br />
Each RNN cell will have its own set of weights and bias as well as be any variant of an RNN. There is no requirement on how the RNN itself is structured, rather the core concept is to allow the model to dynamically make a decision as to which cell to use when processing input tokens. Note that skipping text can be incorporated by setting <math>d'</math> to 0, which means that when the input token is deemed irrelevant to a query or classification task, nothing about the information in the token is retained within the model.<br />
<br />
Experimental results suggest that this model is faster than using a single large RNN to process all input tokens, as the smaller RNN requires fewer floating point operations to process the token. Additionally, higher accuracy and computational efficiency are achieved. <br />
<br />
==== Inference ====<br />
<br />
At each time step <math>t</math>, the Skim-RNN unit takes in an input <math>{\bf x}_t \in \mathbb{R}^d</math> as well as the previous hidden state <math>{\bf h}_{t-1} \in \mathbb{R}^d</math> and outputs the new state <math>{\bf h}_t </math> (although the dimensions of the hidden state and input are the same, this process holds for different sizes as well). In the Skim-RNN, there is a hard decision that needs to be made whether to read or skim the input, although there could be potential to include options for multiple levels of skimming.<br />
<br />
The decision to read or skim is done using a multinomial random variable <math>Q_t</math> over the probability distribution of choices <math>{\bf p}_t</math>, where<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>{\bf p}_t = \text{softmax}(\alpha({\bf x}_t, {\bf h}_{t-1})) = \text{softmax}({\bf W}[{\bf x}_t; {\bf h}_{t-1}]+{\bf b}) \in \mathbb{R}^k</math><br />
</div><br />
<br />
where <math>{\bf W} \in \mathbb{R}^{k \times 2d}</math>, <math>{\bf b} \in \mathbb{R}^{k}</math> are weights to be learned and <math>[{\bf x}_t; {\bf h}_{t-1}] \in \mathbb{R}^{2d}</math> indicates the row concatenation of the two vectors. In this case <math> \alpha </math> can have any form as long as the complexity of calculating it is less than <math> O(d^2)</math>. Letting <math>{\bf p}^1_t</math> indicate the probability for fully reading and <math>{\bf p}^2_t</math> indicate the probability for skimming the input at time <math> t</math>, it follows that the decision to read or skim can be modelled using a random variable <math> Q_t</math> by sampling from the distribution <math>{\bf p}_t</math> and<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math>Q_t \sim \text{Multinomial}({\bf p}_t)</math><br />
</div><br />
<br />
Without loss of generality, we can define <math> Q_t = 1</math> to indicate that the input will be read while <math> Q_t = 2</math> indicates that it will be skimmed. Reading requires applying the full RNN on the input as well as the previous hidden state to modify the entire hidden state, while skimming only modifies part of the prior hidden state.<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \begin{cases}<br />
f({\bf x}_t, {\bf h}_{t-1}) & Q_t = 1\\<br />
[f'({\bf x}_t, {\bf h}_{t-1});{\bf h}_{t-1}(d'+1:d)] & Q_t = 2<br />
\end{cases}<br />
</math><br />
</div><br />
<br />
where <math> f </math> is a full RNN with output of dimension <math>d</math> and <math>f'</math> is a smaller RNN with <math>d'</math>-dimensional output. This has advantage that when the model decides to skim, then the computational complexity of that step is only <math>O(d'd)</math>, which is much smaller than <math>O(d^2)</math> due to previously defining <math> d' \ll d</math>.<br />
<br />
==== Training ====<br />
<br />
Since the expected loss/error of the model is a random variable that depends on the sequence of random variables <math> \{Q_t\} </math>, the loss is minimized with respect to the distribution of the variables. Defining the loss to be minimized while conditioning on a particular sequence of decisions<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L(\theta\vert Q)<br />
</math><br />
</div><br />
where <math>Q=Q_1\dots Q_T</math> is a sequence of decisions of length <math>T</math>, then the expected loss o ver the distribution of the sequence of decisions is<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
\mathbb{E}[L(\theta)] = \sum_{Q} L(\theta\vert Q)P(Q) = \sum_Q L(\theta\vert Q) \Pi_j {\bf p}_j^{Q_j}<br />
</math><br />
</div><br />
<br />
Since calculating <math>\delta \mathbb{E}_{Q_t}[L(\theta)]</math> directly is rather infeasible, it is possible to approximate the gradients with a gumbel-softmax distribution [2]. Reparameterizing <math> {\bf p}_t</math> as <math> {\bf r}_t</math>, then the back-propagation can flow to <math> {\bf p}_t</math> without being blocked by <math> Q_t</math> and the approximation can arbitrarily approach <math> Q_t</math> by controlling the parameters. The reparameterized distribution is therefore<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf r}_t^i = \frac{\text{exp}(\log({\bf p}_t^i + {g_t}^i)/\tau)}{\sum_j\text{exp}(\log({\bf p}_t^j + {g_t}^j)/\tau)}<br />
</math><br />
</div><br />
<br />
where <math>{g_t}^i</math> is an independent sample from a <math>\text{Gumbel}(0, 1) = -\log(-\log(\text{Uniform}(0, 1))</math> random variable and <math>\tau</math> is a parameter that represents a temperature. Then it can be rewritten that<br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
{\bf h}_t = \sum_i {\bf r}_t^i {\bf \tilde{h}}_t<br />
</math><br />
</div><br />
<br />
where <math>{\bf \tilde{h}}_t</math> is the previous equation for <math>{\bf h}_t</math>. The temperature parameter gradually decreases with time, and <math>{\bf r}_t^i</math> becomes more discrete as it approaches 0.<br />
<br />
A final addition to the model is to encourage skimming when possible. Therefore an extra term related to the negative log probability of skimming and the sequence length. Therefore the final loss function used for the model is denoted by <br />
<br />
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><br />
<math><br />
L'(\theta) =L(\theta) + \gamma \cdot\frac{1}{T} \sum_i -\log({\bf \tilde{p}}^i_t)<br />
</math><br />
</div><br />
where <math> \gamma </math> is a parameter used to control the ratio between the main loss function and the negative log probability of skimming.<br />
<br />
== Experiment ==<br />
<br />
The effectiveness of Skim-RNN was measured in terms of accuracy and float operation reduction on four classification tasks and a question answering task. These tasks were chosen because they do not require one’s full attention to every detail of the text, but rather ask for capturing the high-level information (classification) or focusing on specific portion (QA) of the text, which a common context for speed reading. The tasks themselves are listed in the table below.<br />
<br />
[[File:Table1SkimRNN.png|center|1000px]]<br />
<br />
=== Classification Tasks ===<br />
<br />
In a language classification task, the input was a sequence of words and the output was the vector of categorical probabilities. Each word is embedded into a <math>d</math>-dimensional vector. We initialize the vector with GloVe [4] to form representations of the words and use those as the inputs for a long short-term memory (LSTM) architecture. A linear transformation on the last hidden state of the LSTM and then a softmax function was applied to obtain the classification probabilities. Adam [5] was used for optimization, with initial learning rate of 0.0001. For Skim-LSTM, <math>\tau = \max(0.5, exp(−rn))</math> where <math>r = 1e-4</math> and <math>n</math> is the global training step, following [2]. We experiment on different sizes of big LSTM (<math>d \in \{100, 200\}</math>) and small LSTM (<math>d' \in \{5, 10, 20\}</math>) and the ratio between the model loss and the skim loss (<math>\gamma\in \{0.01, 0.02\}</math>) for Skim-LSTM. The batch sizes used were 32 for SST and Rotten Tomatoes, and 128 for others. For all models, early stopping was used when the validation accuracy did not increase for 3000 global steps.<br />
<br />
==== Results ====<br />
<br />
[[File:Table2SkimRNN.png|center|1000px]]<br />
<br />
[[File:Figure2SkimRNN.png|center|1000px]]<br />
<br />
Table 2 shows the accuracy and the computational cost of the Skim-RNN model compared with other standard models. It is evident that the Skim-RNN model produces a speed-up on the computational complexity of the task while maintaining a high degree of accuracy. Figure 2 meanwhile demonstrates the effect of varying the size of the small hidden state as well as the parameter <math>\gamma</math> on the accuracy and computational cost.<br />
<br />
[[File:Table3SkimRNN.png|center|1000px]]<br />
<br />
Table 3 shows an example of a classification task over a IMDb dataset, where Skim-RNN with <math>d = 200</math>, <math>d' = 10</math>, and <math>\gamma = 0.01</math> correctly classifies it with high skimming rate (92%). The goal was to classify the review as either positive or negative. The black words are skimmed, and blue words are fully read. The skimmed words are clearly irrelevant and the model learns to only carefully read the important words, such as ‘liked’, ‘dreadful’, and ‘tiresome’.<br />
<br />
=== Question Answering Task ===<br />
<br />
In Stanford Question Answering Dataset, the task was to locate the answer span for a given question in a context paragraph. The effectiveness of Skim-RNN for SQuAD was evaluated using two different models: LSTM+Attention and BiDAF [6]. The first model was inspired by most then-present QA systems consisting of multiple LSTM layers and an attention mechanism. This type of model is complex enough to reach reasonable accuracy on the dataset, and simple enough to run well-controlled analyses for the Skim-RNN. The second model wan an open-source model designed for SQuAD, used primarily to show that Skim-RNN could replace RNN in existing complex systems.<br />
<br />
==== Training ==== <br />
<br />
Adam was used with an initial learning rate of 0.0005. For stable training, the model was pretrained with a standard LSTM for the first 5k steps , and then fine-tuned with Skim-LSTM.<br />
<br />
==== Results ====<br />
<br />
[[File:Table4SkimRNN.png|center|1000px]]<br />
<br />
Table 4 shows the accuracy (F1 and EM) of LSTM+Attention and Skim-LSTM+Attention models as well as VCRNN [7]. It can be observed from the table that the skimming models achieve higher or similar accuracy scores compared to the non-skimming models while also reducing the computational cost by more than 1.4 times. In addition, decreasing layers (1 layer) or hidden size (<math>d=5</math>) improved the computational cost but significantly decreases the accuracy compared to skimming. The table also shows that replacing LSTM with Skim-LSTM in an existing complex model (BiDAF) stably gives reduced computational cost without losing much accuracy (only 0.2% drop from 77.3% of BiDAF to 77.1% of Sk-BiDAF with <math>\gamma = 0.001</math>).<br />
<br />
An explanation for this trend that was given is that the model is more confident about which tokens are important at the second layer. Second, higher <math>\gamma</math> values lead to higher skimming rate, which agrees with its intended functionality.<br />
<br />
Figure 4 shows the F1 score of LSTM+Attention model using standard LSTM and Skim LSTM, sorted in ascending order by Flop-R (computational cost). While models tend to perform better with larger computational cost, Skim LSTM (Red) outperforms standard LSTM (Blue) with comparable computational cost. It can also be seen that the computational cost of Skim-LSTM is more stable across different configurations and computational cost. Moreover, increasing the value of <math>\gamma</math> for Skim-LSTM gradually increases the skipping rate and Flop-R, while it also led to reduced accuracy.<br />
<br />
=== Runtime Benchmark ===<br />
<br />
[[File:Figure6SkimRNN.png|center|1000px]]<br />
<br />
The details of the runtime benchmarks for LSTM and Skim-LSTM, are used estimate the speed up of Skim-LSTM-based models in the experiments, are also discussed. A CPU-based benchmark was assumed to be the default benchmark, which has direct correlation with the number of float operations that can be performed per second. As mentioned previously, the speed-up results in Table 2 (as well as Figure 7) are benchmarked using Python (NumPy), instead of popular frameworks such as TensorFlow or PyTorch.<br />
<br />
Figure 7 shows the relative speed gain of Skim-LSTM compared to standard LSTM with varying hidden state size and skim rate. NumPy was used, with the inferences run on a single thread of CPU. The ratio between the reduction of the number of float operations (Flop-R) of LSTM and Skim-LSTM was plotted, with the ratio acting as a theoretical upper bound of the speed gain on CPUs. From here, it can be noticed that there is a gap between the actual gain and the theoretical gain in speed, with the gap being larger with more overhead of the framework, or more parallelization. The gap also decreases as the hidden state size increases because the the overhead becomes negligible with very large matrix operations. This indicates that Skim-RNN provide greater benefits for RNNs with larger hidden state size.<br />
<br />
== Results ==<br />
<br />
The results clearly indicate that the Skim-RNN model provides features that are suitable for general reading tasks, which include classification and question answering. While the tables indicate that minor losses in accuracy occasionally did result when parameters were set at specific values, they were not minor and were acceptable given the improvement in runtime.<br />
<br />
An important advantage of Skim-RNN is that the skim rate (and thus computational cost) can be dynamically controlled at inference time by adjusting the threshold for<br />
‘skim’ decision probability <math>{\bf p}^1_t</math>. Figure 5 shows the trade-off between the accuracy and computational cost for two settings, confirming the importance of skimming (<math>d' > 0</math>) compared to skipping (<math>d' = 0</math>).<br />
<br />
Figure 6 shows that the model does not skim when the input seems to be relevant to answering the question, which was as expected by the design of the model. In addition, the LSTM in second layer skims more than that in the first layer mainly because the second layer is more confident about the importance of each token.<br />
<br />
== Conclusion ==<br />
<br />
A Skim-RNN can offer better latency results on a CPU compared to a standard RNN on a GPU, with lower computational cost, as demonstrated through the results of this study. Future work (as stated by the authors) involves using Skim-RNN for applications that require much higher hidden state size, such as video understanding, and using multiple small RNN cells for varying degrees of skimming. Further, since it has the same input and output interface as a regular RNN it can replace RNNs in existing applications.<br />
<br />
== Critiques ==<br />
<br />
1. It seems like Skim-RNN is using the not full RNN of processing words that are not important thus can increase speed in some very particular circumstances (ie, only small networks). The extra model complexity did slow down the speed while trying to "optimizing" the efficiency and sacrifice part of accuracy while doing so. It is only trying to target a very specific situation (classification/question-answering) and made comparisons only with the baseline LSTM model. It would be definitely more persuasive if the model can compare with some of the state of art nn models.<br />
<br />
2. This model of Skim-RNN is pretty good to extract binary classification type of text, thus it would be interesting for this to be applied to stock market news analyzing. For example a press release from a company can be analyzed quickly using this model and immediately give the trader a positive or negative summary of the news. Would be beneficial in trading since time and speed is an important factor when executing a trade.<br />
<br />
== References ==<br />
<br />
[1] Patricia Anderson Carpenter Marcel Adam Just. The Psychology of Reading and Language Comprehension. 1987.<br />
<br />
[2] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.<br />
<br />
[3] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.<br />
<br />
[4] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, 2014.<br />
<br />
[5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.<br />
<br />
[6] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In ICLR, 2017a.<br />
<br />
[7] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. In ICLR, 2017.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence&diff=46119Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-23T20:21:04Z<p>Hhalim: /* Neighbor-Aware Decoder */</p>
<hr />
<div>== Presented by == <br />
Guanting Pan, Zaiwei Zhang, Haocheng Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is being easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) became an important part of human’s life. People can share their experiences in a location, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can provide a service, which is called personalized POI recommendation, to give recommendations to users that the location they might be interested in. These large amounts of data can be used to train a model to predict a POI that users might be interested in using Machine Learning (i.e. Classification, Clustering, etc.). This paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for self-attentive encoder while NAD stands for the neighbor-aware decoder. This method will include machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. But the SAE adaptively differentiates the user preference degrees in multiple aspects.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g. Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. So, machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation.<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which is constructed by two parts: an encoder and a decoder. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector x_u to hidden representation A_u using weight matrix W^1 and the activation function softmax and tanh.The 0's and 1's in x_u indicates whether the user has been to a certain POI. The weight matrix W_a assigns different weights on various features of POIs.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surrounds the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who have visited a location are very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez are very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation on the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical for one to understand the meanings behind each evaluation metrics. Suppose the proposed model generated k recommended POIs for the user. The first metrics, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behaviour. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement on geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbour-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed it advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbour-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it is able to generate recommendations that are highly relevant to its users.<br />
<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_Network&diff=46118Semantic Relation Classification——via Convolution Neural Network2020-11-23T19:56:45Z<p>Hhalim: /* Results */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE,PART WHOLE, TOPIC, and COMPARE.<br />
<br />
Data comes from 350 scientific paper abstracts, which have 1228 and 1248 annotated sentences for two tasks. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Network.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8(Hendrickx et al., 2010) studied the 9 relations between word pairs. However, it is not specially for scientific text analysis. Xu et al. (2015a) and Santos et al. (2015) , both of them applied CNN with negative sampling to finish task7.<br />
<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px]]<br />
<br />
This is the architecture of the CNN. We first transform a sentence via Feature embeddings. Basically we transform each sentence into continuous word embeddings:<br />
<br />
[[File:WordPosition.png]]<br />
<br />
<br />
And word position embeddings:<br />
<br />
[[File:Position.png]]<br />
<br />
In the word embeddings, we got a vocabulary ‘V’, and we will make an embedding word matrix based on the position of the word in the vocabulary. This matrix and trainable and need to be initialized by pre-trained embedding vectors.<br />
In the word position embeddings, we first need to input some words named ‘entities’ and they are the key for the machine to determinate sentence’s relation. During this process, if we have two entities, we will used the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors and one of them keep track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally we will get two vectors concatenated as the position embedding.<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence to a fix-sized representation of the whole sentence via the convolution layer, finally after the max pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
is given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=tanh(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor <br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
is produced.<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=log(1+e^{\gamma(m^{+}-s(x)_{y})})+log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
In machine learning, the most important part is to tune the hyper-parameters. Unlike traditional hyper-parameter optimization, there are some<br />
modifications to the model in order to increase performance on the test set. There are 5 modifications that we can apply:<br />
<br />
1. Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
2. Reversal Indicate Features. It added binary feature.<br />
<br />
3. Custom ACL Embeddings. It embedded word vector to an ACL-specific<br />
corps.<br />
<br />
4. Context words. Within the sentence, it varies size on a context window<br />
around the entity-enclosed text.<br />
<br />
5. Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in the table 3.<br />
<br />
[[File:table3.PNG]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling. Because the random initialization made the data more nature and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increases the score on the cross validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for training set 1.1 is the third submission which did not<br />
use the cross validation as the test set. Instead, it runs a constant number of<br />
training epochs and based on the training data it can be chosen by cross validation. The best submission for training set 1.2 is the first submission which<br />
extracted 10% of the training data as validation accuracy on the test set predictions.<br />
All in all, early stop cannot always based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them together to see the prediction<br />
results. Also, doing stratification will certainly to improve the performance on<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, linear classifiers, sequential random forest, LSTM and CNN models are tested. Variations are applied to the models. Among all variations, vanilla CNN with negative sampling and ACL-embedding has significant better performance than all others. Attention based pooling, up-sampling and data augmentation are also tested, but they barely perform positive incresement on the behaviour.<br />
<br />
== References ==<br />
Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=46057stat441F212020-11-23T16:14:04Z<p>Hhalim: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54]<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification Summary] || [https://www.youtube.com/watch?v=UgVRzi9Ubws]<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] || [https://www.youtube.com/watch?v=bQI9S6bCo8o]<br />
|-<br />
|Week of Nov 16 || Casey De Vera, Solaiman Jawad || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost Summary] || [https://www.youtube.com/watch?v=4DhJDGC4pyI&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang Summary] || [https://youtu.be/B8pPUU8ypO0]<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] || [https://www.youtube.com/watch?v=tvCEvvy54X8&ab_channel=JJLian]<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat Summary] || [https://www.youtube.com/watch?v=or5RTxDnZDo]<br />
|-<br />
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Large Scale Landmark Recognition via Deep Metric Learning || [https://arxiv.org/pdf/1908.10192.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] || [https://youtu.be/i_5PQdfuH-Y]<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification——via_Convolution_Neural_Network Summary]||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks Summary] || [https://youtu.be/x9eUgEwntcs Video]<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker Summary]|| [https://www.youtube.com/watch?v=kazqcOwbtTI Video]<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence Summary] || [https://www.youtube.com/watch?v=aAwjaos_Hus]<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]|| [https://youtu.be/vOENmt9jgVE Video]<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan Summary]||<br />
|-<br />
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || ||<br />
|-<br />
|Week of Nov 30 || Sai Arvind Budaraju, Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] ||<br />
|-<br />
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Daehyun Kim || 27|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf paper] || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] || ||<br />
|-<br />
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 30 || Mushi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction Summary] ||</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=46035stat441F212020-11-23T04:33:30Z<p>Hhalim: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
[https://www.youtube.com/watch?v=TVLpSFYgF0c&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes Summary]|| [https://www.youtube.com/watch?v=HZG9RAHhpXc&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin Summary] || [https://learn.uwaterloo.ca/d2l/ext/rp/577051/lti/framedlaunch/6ec1ebea-5547-46a2-9e4f-e3dc9d79fd54]<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification Summary] || [https://www.youtube.com/watch?v=UgVRzi9Ubws]<br />
|-<br />
|Week of Nov 16 || Matthew Hall, Johnathan Chalaturnyk || 5|| Neural Ordinary Differential Equations || [https://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEs Summary]||<br />
|-<br />
|Week of Nov 16 || Luwen Chang, Qingyang Yu, Tao Kong, Tianrong Sun || 6|| Adversarial Attacks on Copyright Detection Systems || Paper [https://proceedings.icml.cc/static/paper_files/icml/2020/1894-Paper.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Attacks_on_Copyright_Detection_Systems Summary] || [https://www.youtube.com/watch?v=bQI9S6bCo8o]<br />
|-<br />
|Week of Nov 16 || Casey De Vera, Solaiman Jawad || 7|| IPBoost – Non-Convex Boosting via Integer Programming || [https://arxiv.org/pdf/2002.04679.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoost Summary] || [https://www.youtube.com/watch?v=4DhJDGC4pyI&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran || 8|| What Game Are We Playing? End-to-end Learning in Normal and Extensive Form Games || [https://arxiv.org/pdf/1805.02777.pdf] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing Summary] || [https://www.youtube.com/watch?v=9qJoVxo3hnI&feature=youtu.be]<br />
|-<br />
|Week of Nov 16 || Yuchuan Wu || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || Zhou Zeping, Siqi Li, Yuqin Fang, Fu Rao || 10|| A survey of neural network-based cancer prediction models from microarray data || [https://www.sciencedirect.com/science/article/pii/S0933365717305067 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Y93fang Summary] || [https://youtu.be/B8pPUU8ypO0]<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:J46hou Summary] || [https://www.youtube.com/watch?v=tvCEvvy54X8&ab_channel=JJLian]<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Networks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat Summary] || [https://www.youtube.com/watch?v=or5RTxDnZDo]<br />
|-<br />
|Week of Nov 23 || Taohao Wang, Zeren Shen, Zihao Guo, Rui Chen || 13|| Large Scale Landmark Recognition via Deep Metric Learning || [https://arxiv.org/pdf/1908.10192.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_Data Summary] || [https://youtu.be/i_5PQdfuH-Y]<br />
|-<br />
|Week of Nov 23 || Rui Gong, Xuetong Wang, Xinqi Ling, Di Ma || 15|| Semantic Relation Classification via Convolution Neural Network|| [https://www.aclweb.org/anthology/S18-1127.pdf paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification——via_Convolution_Neural_Network Summary]||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| Graph Structure of Neural Networks || [https://proceedings.icml.cc/paper/2020/file/757b505cfd34c64c85ca5b5690ee5293-Paper.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_Networks Summary] || [https://youtu.be/x9eUgEwntcs Video]<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker Summary]||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_Influence Summary] || [https://www.youtube.com/watch?v=aAwjaos_Hus]<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_Speed_Reading_via_Skim-RNN Summary]|| [https://youtu.be/vOENmt9jgVE Video]<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Yktan Summary]||<br />
|-<br />
|Week of Nov 30 || Banno Dion, Battista Joseph, Kahn Solomon || 21|| Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks || [https://www.sciencedirect.com/science/article/pii/S1877050919310646] || ||<br />
|-<br />
|Week of Nov 30 || Sai Arvind Budaraju, Isaac Ellmen, Dorsa Mohammadrezaei, Emilee Carson || 22|| A universal SNP and small-indel variant caller using deep neural networks||[https://www.nature.com/articles/nbt.4235.epdf?author_access_token=q4ZmzqvvcGBqTuKyKgYrQ9RgN0jAjWel9jnR3ZoTv0NuM3saQzpZk8yexjfPUhdFj4zyaA4Yvq0LWBoCYQ4B9vqPuv8e2HHy4vShDgEs8YxI_hLs9ov6Y1f_4fyS7kGZ Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Fagan, Cooper Brooke, Maya Perelman || 23|| Efficient kNN Classification With Different Number of Nearest Neighbors || [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7898482 Paper] || ||<br />
|-<br />
|Week of Nov 30 || Karam Abuaisha, Evan Li, Jason Pu, Nicholas Vadivelu || 24|| Being Bayesian about Categorical Probability || [https://proceedings.icml.cc/static/paper_files/icml/2020/3560-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 ||Zihui (Betty) Qin, Wenqi (Maggie) Zhao, Muyuan Yang, Amartya (Marty) Mukherjee || 26|| Deep Learning for Cardiologist-level Myocardial Infarction Detection in Electrocardiograms || [https://arxiv.org/pdf/1912.07618.pdf?fbclid=IwAR0RwATSn4CiT3qD9LuywYAbJVw8YB3nbex8Kl19OCExIa4jzWaUut3oVB0 Paper] || Summary [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Deep_Learning_for_Cardiologist-level_Myocardial_Infarction_Detection_in_Electrocardiograms&fbclid=IwAR1Tad2DAM7LT6NXXuSYDZtHHBvN0mjZtDdCOiUFFq_XwVcQxG3hU-3XcaE] ||<br />
|-<br />
|Week of Nov 30 || Stan Lee, Seokho Lim, Kyle Jung, Daehyun Kim || 27|| Bag of Tricks for Efficient Text Classification || [https://arxiv.org/pdf/1607.01759.pdf paper] || ||<br />
|-<br />
|Week of Nov 30 || Yawen Wang, Danmeng Cui, ZiJie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid || 28|| A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques || [https://arxiv.org/pdf/1707.02919.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_Mining Summary] ||<br />
|-<br />
|Week of Nov 30 || Qing Guo, XueGuang Ma, James Ni, Yuanxin Wang || 29|| Mask R-CNN || [https://arxiv.org/pdf/1703.06870.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-<br />
|Week of Nov 30 || Daniel Zhang, Jacky Yao, Scholar Sun, Russell Parco, Ian Cheung || 31 || Speech2Face: Learning the Face Behind a Voice || [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform Paper] || ||<br />
|-<br />
|Week of Nov 30 || Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du || 32 || Evaluating Machine Accuracy on ImageNet || [https://proceedings.icml.cc/static/paper_files/icml/2020/6173-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 30 || Mushi Wang, Siyuan Qiu, Yan Yu || 33 || Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections || [https://ieeexplore.ieee.org/abstract/document/8957421 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_Prediction Summary] ||</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45728Superhuman AI for Multiplayer Poker2020-11-22T19:21:48Z<p>Hhalim: /* References */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of this abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. It compares it with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
<br />
[[File:left.PNG| 425px | x215px |left]] [[File:right.PNG| 425px | x215px |right ]]<br />
<br />
<div align="center">Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.</div> <br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
<br />
== Discussion and Critiques ==<br />
<br />
The blueprint strategy for Pluribus uses two abstraction methods which reduces the computational power required. Thus Pluribus was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (12 Jul 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (2020, November 17). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45727Superhuman AI for Multiplayer Poker2020-11-22T19:18:14Z<p>Hhalim: /* References */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of this abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. It compares it with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
<br />
[[File:left.PNG| 425px | x215px |left]] [[File:right.PNG| 425px | x215px |right ]]<br />
<br />
<div align="center">Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.</div> <br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
<br />
== Discussion and Critiques ==<br />
<br />
The blueprint strategy for Pluribus uses two abstraction methods which reduces the computational power required. Thus Pluribus was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (12 Jul 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (2020, November 17). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45726Superhuman AI for Multiplayer Poker2020-11-22T19:14:30Z<p>Hhalim: /* References */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of this abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. It compares it with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
<br />
[[File:left.PNG| 425px | x215px |left]] [[File:right.PNG| 425px | x215px |right ]]<br />
<br />
<div align="center">Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.</div> <br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
<br />
== Discussion and Critiques ==<br />
<br />
The blueprint strategy for Pluribus uses two abstraction methods which reduces the computational power required. Thus Pluribus was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Nash equilibrium. (2020, November 22). In Wikipedia. https://en.wikipedia.org/wiki/Nash_equilibrium</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=45539User:Cvmustat2020-11-21T18:38:27Z<p>Hhalim: /* Conclusion & Summary */</p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text, quickly and cost-effectively, allowing for individuals to extract import features from the text easier than before. <br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it is able to capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. In addition, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. The CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. The architecture combines these two matrix representations to classify the text as well as offer the importance information of each word for the prediction which can help with the interpretation of the results. The model also uses dropout and L2 regularization to prevent overfitting.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
'''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
'''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
'''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset.<br />
<br />
'''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes from the dataset.<br />
<br />
'''Sogou News:''' a Chinese news categorization dataset, using the 4 largest classes from the dataset.<br />
<br />
'''Yahoo! Answers:''' a topic classification dataset, with 10 classes.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing, to obtain the following results:<br />
<br />
[[File:table of results.png|550px]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model manages to be the best performing in 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:<br />
<br />
[[File:filter_effects.png|550px]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle memorizing words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came in the beginning of the sequence that are important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by computed by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math>, and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, because n is the number of words in the input sequence and z is the number of aspects being considered in the classification task. <br />
<br />
Furthermore, for a specific aspect, words with higher attention values are more important relative to other words in the same input sequence. For a specific word, aspects with higher attention values make the specific word more important compared to other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px]]<br />
<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from different aspects and stores it into a weight matrix. The Recurrent Neural Network learns each words contextual representation through the combination of the forward and backward context informations that are fused using a neural tensor layer and is stored as a matrix. These two matrices are then combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analysis. In the future, one can explore the features that extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation.<br />
<br />
In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Cvmustat&diff=45538User:Cvmustat2020-11-21T18:37:35Z<p>Hhalim: /* Conclusion & Summary */</p>
<hr />
<div><br />
== Combine Convolution with Recurrent Networks for Text Classification == <br />
'''Team Members''': Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea<br />
<br />
'''Date''': Week of Nov 23 <br />
<br />
== Introduction ==<br />
<br />
<br />
Text classification is the task of assigning a set of predefined categories to natural language texts. It is a fundamental task in Natural Language Processing (NLP) with various applications such as sentiment analysis, and topic classification. A classic example involving text classification is given a set of News articles, is it possible to classify the genre or subject of each article? Text classification is useful as text data is a rich source of information, but extracting insights from it directly can be difficult and time consuming as most text data is unstructured.[1] NLP text classification can help automatically structure and analyze text, quickly and cost-effectively, allowing for individuals to extract import features from the text easier than before. <br />
<br />
In practice, pre-trained word embeddings and deep neural networks are used together for NLP text classification. Word embeddings are used to map the raw text data to an implicit space where the semantic relationships of the words are preserved; words with similar meaning have a similar representation. One can then feed these embeddings into deep neural networks to learn different features of the text. Convolutional neural networks can be used to determine the semantic composition of the text(the meaning), as it is able to capture both local and position invariant features of the text.[2] Alternatively, Recurrent Neural Networks can be used to determine the contextual meaning of each word in the text (how each word relates to one another) by treating the text as sequential data and then analyzing each word separately. [3] Previous approaches to attempt to combine these two neural networks to incorporate the advantages of both models involve streamlining the two networks, which might decrease their performance. In addition, most methods incorporating a bi-directional Recurrent Neural Network usually concatenate the forward and backward hidden states at each time step, which results in a vector that does not have the interaction information between the forward and backward hidden states.[4] The hidden state in one direction contains only the contextual meaning in that particular direction, however a word's contextual representation, intuitively, is more accurate when collected and viewed from both directions. This paper argues that the failure to observe the meaning of a word in both directions causes the loss of the true meaning of the word, especially for polysemic words (words with more than one meaning) that are context sensitive.<br />
<br />
== Paper Key Contributions ==<br />
<br />
This paper suggests an enhanced method of text classification by proposing a new way of combining Convolutional and Recurrent Neural Networks involving the addition of a neural tensor layer. The proposed method maintains each network's respective strengths that are normally lost in previous combination methods. The new suggested architecture is called CRNN, which utilizes both a CNN and RNN that run in parallel on the same input sentence. The CNN uses weight matrix learning and produces a 2D matrix that shows the importance of each word based on local and position-invariant features. The bidirectional RNN produces a matrix that learns each word's contextual representation; the words' importance in relation to the rest of the sentence. A neural tensor layer is introduced on top of the RNN to obtain the fusion of bi-directional contextual information surrounding a particular word. The architecture combines these two matrix representations to classify the text as well as offer the importance information of each word for the prediction which can help with the interpretation of the results. The model also uses dropout and L2 regularization to prevent overfitting.<br />
<br />
== CRNN Results vs Benchmarks ==<br />
<br />
In order to benchmark the performance of the CRNN model, as well as compare it to other previous efforts, multiple datasets and classification problems were used. All of these datasets are publicly available and can be easily downloaded by any user for testing.<br />
<br />
'''Movie Reviews:''' a sentiment analysis dataset, with two classes (positive and negative).<br />
<br />
'''Yelp:''' a sentiment analysis dataset, with five classes. For this test, a subset of 120,000 reviews was randomly chosen from each class for a total of 600,000 reviews.<br />
<br />
'''AG's News:''' a news categorization dataset, using only the 4 largest classes from the dataset.<br />
<br />
'''20 Newsgroups:''' a news categorization dataset, again using only 4 large classes from the dataset.<br />
<br />
'''Sogou News:''' a Chinese news categorization dataset, using the 4 largest classes from the dataset.<br />
<br />
'''Yahoo! Answers:''' a topic classification dataset, with 10 classes.<br />
<br />
For the English language datasets, the initial word representations were created using the publicly available ''word2vec'' [https://code.google.com/p/word2vec/] from Google news. For the Chinese language dataset, ''jieba'' [https://github.com/fxsjy/jieba] was used to segment sentences, and then 50-dimensional word vectors were trained on Chinese ''wikipedia'' using ''word2vec''.<br />
<br />
A number of other models are run against the same data after preprocessing, to obtain the following results:<br />
<br />
[[File:table of results.png|550px]]<br />
<br />
The bold results represent the best performing model for a given dataset. These results show that the CRNN model manages to be the best performing in 4 of the 6 datasets, with the Self-attentive LSTM beating the CRNN by 0.03 and 0.12 on the news categorization problems. Considering that the CRNN model has better performance than the Self-attentive LSTM on the other 4 datasets, this suggests that the CRNN model is a better performer overall in the conditions of this benchmark.<br />
<br />
Another important result was that the CRNN model filter size impacted performance only in the sentiment analysis datasets, as seen in the following:<br />
<br />
[[File:filter_effects.png|550px]]<br />
<br />
== CRNN Model Architecture ==<br />
<br />
'''RNN Pipeline:'''<br />
<br />
The goal of the RNN pipeline is to input each word in a text, and retrieve the contextual information surrounding the word and compute the contextual representation of the word itself. This is accomplished by use of a bi-directional RNN, such that a Neural Tensor Layer (NTL) can combine the results of the RNN to obtain the final output. RNNs are well-suited to NLP tasks because of their ability to sequentially process data such as ordered text.<br />
<br />
A RNN is similar to a feed-forward neural network, but it relies on the use of hidden states. Hidden states are layers in the neural net that produce two outputs: <math> \hat{y}_{t} </math> and <math> h_t </math>. For a time step <math> t </math>, <math> h_t </math> is fed back into the layer to compute <math> \hat{y}_{t+1} </math> and <math> h_{t+1} </math>. <br />
<br />
The pipeline will actually use a variant of RNN called GRU, short for Gated Recurrent Units. This is done to address the vanishing gradient problem which causes the network to struggle memorizing words that came earlier in the sequence. Traditional RNNs are only able to remember the most recent words in a sequence, which may be problematic since words that came in the beginning of the sequence that are important to the classification problem may be forgotten. A GRU attempts to solve this by controlling the flow of information through the network using update and reset gates. <br />
<br />
Let <math>h_{t-1} \in \mathbb{R}^m, x_t \in \mathbb{R}^d </math> be the inputs, and let <math>\mathbf{W}_z, \mathbf{W}_r, \mathbf{W}_h \in \mathbb{R}^{m \times d}, \mathbf{U}_z, \mathbf{U}_r, \mathbf{U}_h \in \mathbb{R}^{m \times m}</math> be trainable weight matrices. Then the following equations describe the update and reset gates:<br />
<br />
<math><br />
z_t = \sigma(\mathbf{W}_zx_t + \mathbf{U}_zh_{t-1}) \text{update gate} \\<br />
r_t = \sigma(\mathbf{W}_rx_t + \mathbf{U}_rh_{t-1}) \text{reset gate} \\<br />
\tilde{h}_t = \text{tanh}(\mathbf{W}_hx_t + r_t \circ \mathbf{U}_hh_{t-1}) \text{new memory} \\<br />
h_t = (1-z_t)\circ \tilde{h}_t + z_t\circ h_{t-1}<br />
</math><br />
<br />
Note that <math> \sigma, \text{tanh}, \circ </math> are all element-wise functions. The above equations do the following:<br />
<br />
<ol><br />
<li> <math>h_{t-1}</math> carries information from the previous iteration and <math>x_t</math> is the current input </li><br />
<li> the update gate <math>z_t</math> controls how much past information should be forwarded to the next hidden state </li><br />
<li> the rest gate <math>r_t</math> controls how much past information is forgotten or reset </li><br />
<li> new memory <math>\tilde{h}_t</math> contains the relevant past memory as instructed by <math>r_t</math> and current information from the input <math>x_t</math> </li><br />
<li> then <math>z_t</math> is used to control what is passed on from <math>h_{t-1}</math> and <math>(1-z_t)</math> controls the new memory that is passed on<br />
</ol><br />
<br />
We treat <math>h_0</math> and <math> h_{n+1} </math> as zero vectors in the method. Thus, each <math>h_t</math> can be computed as above to yield results for the bi-directional RNN. The resulting hidden states <math>\overrightarrow{h_t}</math> and <math>\overleftarrow{h_t}</math> contain contextual information around the <math> t</math>-th word in forward and backward directions respectively. Contrary to convention, instead of concatenating these two vectors, it is argued that the word's contextual representation is more precise when the context information from different directions is collected and fused using a neural tensor layer as it permits greater interactions among each element of hidden states. Using these two vectors as input to the neural tensor layer, <math>V^i </math>, we compute a new representation that aggregates meanings from the forward and backward hidden states more accurately as follows:<br />
<br />
<math> <br />
[\hat{h_t}]_i = tanh(\overrightarrow{h_t}V^i\overleftarrow{h_t} + b_i) <br />
</math><br />
<br />
Where <math>V^i \in \mathbb{R}^{m \times m} </math> is the learned tensor layer, and <math> b_i \in \mathbb{R} </math> is the bias.We repeat this <math> m </math> times with different <math>V^i </math> matrices and <math> b_i </math> vectors. Through the neural tensor layer, each element in <math> [\hat{h_t}]_i </math> can be viewed as a different type of intersection between the forward and backward hidden states. In the model, <math> [\hat{h_t}]_i </math> will have the same size as the forward and backward hidden states. We then concatenate the three hidden states vectors to form a new vector that summarizes the context information :<br />
<math><br />
\overleftrightarrow{h_t} = [\overrightarrow{h_t}^T,\overleftarrow{h_t}^T,\hat{h_t}]^T <br />
</math><br />
<br />
We calculate this vector for every word in the text and then stack them all into matrix <math> H </math> with shape <math>n</math>-by-<math>3m</math>.<br />
<br />
<math><br />
H = [\overleftrightarrow{h_1};...\overleftrightarrow{h_n}]<br />
</math><br />
<br />
This <math>H</math> matrix is then forwarded as the results from the Recurrent Neural Network.<br />
<br />
<br />
'''CNN Pipeline:'''<br />
<br />
The goal of the CNN pipeline is to learn the relative importance of words in an input sequence based on different aspects. The process of this CNN pipeline is summarized as the following steps:<br />
<br />
<ol><br />
<li> Given a sequence of words, each word is converted into a word vector using the word2vec algorithm which gives matrix X. <br />
</li><br />
<br />
<li> Word vectors are then convolved through the temporal dimension with filters of various sizes (ie. different K) with learnable weights to capture various numerical K-gram representations. These K-gram representations are stored in matrix C.<br />
</li><br />
<br />
<ul><br />
<li> The convolution makes this process capture local and position-invariant features. Local means the K words are contiguous. Position-invariant means K contiguous words at any position are detected in this case via convolution.<br />
<br />
<li> Temporal dimension example: convolve words from 1 to K, then convolve words 2 to K+1, etc<br />
</li><br />
</ul><br />
<br />
<li> Since not all K-gram representations are equally meaningful, there is a learnable matrix W which takes the linear combination of K-gram representations to more heavily weigh the more important K-gram representations for the classification task.<br />
</li><br />
<br />
<li> Each linear combination of the K-gram representations gives the relative word importance based on the aspect that the linear combination encodes.<br />
</li><br />
<br />
<li> The relative word importance vs aspect gives rise to an interpretable attention matrix A, where each element says the relative importance of a specific word for a specific aspect.<br />
</li><br />
<br />
</ol><br />
<br />
== Merging RNN & CNN Pipeline Outputs ==<br />
<br />
The results from both the RNN and CNN pipeline can be merged by computed by simply multiplying the output matrices. That is, we compute <math>S=A^TH</math> which has shape <math>z \times 3m</math> and is essentially a linear combination of the hidden states. The concatenated rows of S results in a vector in <math>\mathbb{R}^{3zm}</math>, and can be passed to a fully connected Softmax layer to output a vector of probabilities for our classification task. <br />
<br />
To train the model, we make the following decisions:<br />
<ul><br />
<li> Use cross-entropy loss as the loss function </li><br />
<li> Perform dropout on random columns in matrix C in the CNN pipeline </li><br />
<li> Perform L2 regularization on all parameters </li><br />
<li> Use stochastic gradient descent with a learning rate of 0.001 </li><br />
</ul><br />
<br />
== Interpreting Learned CRNN Weights ==<br />
<br />
Recall that attention matrix A essentially stores the relative importance of every word in the input sequence for every aspect chosen. Naturally, this means that A is an n-by-z matrix, because n is the number of words in the input sequence and z is the number of aspects being considered in the classification task. <br />
<br />
Furthermore, for a specific aspect, words with higher attention values are more important relative to other words in the same input sequence. For a specific word, aspects with higher attention values make the specific word more important compared to other aspects.<br />
<br />
For example, in this paper, a sentence is sampled from the Movie Reviews dataset and the transpose of attention matrix A is visualized. Each word represents an element in matrix A, the intensity of red represents the magnitude of an attention value in A, and each sentence is the relative importance of each word for a specific context. In the first row, the words are weighted in terms of a positive aspect, in the last row, the words are weighted in terms of a negative aspect, and in the middle row, the words are weighted in terms of a positive and negative aspect. Notice how the relative importance of words is a function of the aspect.<br />
<br />
[[File:Interpretation example.png|800px]]<br />
<br />
<br />
== Conclusion & Summary ==<br />
<br />
This paper proposed a new architecture, the Convolutional Recurrent Neural Network, for text classification. The Convolutional Neural Network is used to learn the relative importance of each word from different aspects and stores it into a weight matrix. The Recurrent Neural Network learns each words contextual representation through the combination of the forward and backward context informations that are fused using a neural tensor layer and is stored as a matrix. These two matrices are combined to get the text representation used for classification. Although the specifics of the performed tests are lacking, the experiment's results indicate that the proposed method performed well in comparison to most previous methods. In addition to performing well, the proposed method also provides insight into which words contribute greatly to the classification decision as the learned matrix from the Convolutional Neural Network stores the relative importance of each word. This information can then be used in other applications or analysis. In the future, one can explore the features that extracted from the model and use them to potentially learn new methods such as model space. [5]<br />
<br />
== Critiques ==<br />
<br />
In the '''Method''' section of the paper, some explanations used the same notation for multiple different elements of the model. This made the paper harder to follow and understand since they were referring to different elements by identical notation.<br />
<br />
In the '''Results''' section of the paper, they tried to show that the classification results from the CRNN model can be better interpreted than other models. In these explanations, the details were lacking and the authors did not adequately demonstrate how their model is better than others.<br />
<br />
Finally, in the '''Results''' section again, the paper compares the CRNN model to several models which they did not implement and reproduce results with. This can be seen in the chart of results above, where several models do not have entries in the table for all six datasets. Since the authors used a subset of the datasets, these other models which were not reproduced could have different accuracy scores if they had been tested on the same data as the CRNN model. This difference in training and testing data is not mentioned in the paper, and the conclusion that the CRNN model is better in all cases may not be valid.<br />
<br />
== References ==<br />
----<br />
<br />
[1] Grimes, Seth. “Unstructured Data and the 80 Percent Rule.” Breakthrough Analysis, 1 Aug. 2008, breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/.<br />
<br />
[2] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,”<br />
arXiv preprint arXiv:1404.2188, 2014.<br />
<br />
[3] K. Cho, B. V. Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning<br />
phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint<br />
arXiv:1406.1078, 2014.<br />
<br />
[4] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural networks for text classification,” in Proceedings<br />
of AAAI, 2015, pp. 2267–2273.<br />
<br />
[5] H. Chen, P. Tio, A. Rodan, and X. Yao, “Learning in the model space for cognitive fault diagnosis,” IEEE<br />
Transactions on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 124–136, 2014.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Summary_for_survey_of_neural_networked-based_cancer_prediction_models_from_microarray_data&diff=45535Summary for survey of neural networked-based cancer prediction models from microarray data2020-11-21T18:21:34Z<p>Hhalim: /* Neural network-based cancer prediction models */</p>
<hr />
<div>== Presented by == <br />
Rao Fu, Siqi Li, Yuqin Fang, Zeping Zhou<br />
<br />
== Introduction == <br />
Microarray technology is widely used in analyzing genetic diseases as it can help researchers detect genetic information rapidly. In the study of cancer, the researchers use this technology to compare normal and abnormal cancerous tissues so that they can gain a better understanding about the pathology of cancer. However, what could affect the accuracy and computation time of this cancer model is the high dimensionality of the gene expressions. To cope with this problem, we need to use the feature selection method or feature creation method. <br />
One of the most powerful methods in machine learning is neural networks. In this paper, we will review the latest neural network-based cancer prediction models by presenting the methodology of preprocessing, filtering, prediction, and clustering gene expressions.<br />
<br />
== Background == <br />
<br />
'''Neural Network''' <br><br />
Neural networks are often used to solve non-linear complex problems. It is an operational model consisting of a large number of neurons connected to each other by different weights. In this network structure, each neuron is related to an activation function for example sigmoid or rectified linear activation functions. To train the network, the inputs are fed forward and the activation function value is calculated at every neuron. The difference between the output of the neural network and the desired output is what we called an error.<br />
The backpropagation mechanism is one of the most commonly used algorithms in solving neural network problems. By using this algorithm, we optimize the objective function by propagating back the generated error through the network to adjust the weights.<br />
In the next sections, we will use the above algorithm but with different network architectures and a different numbers of neurons to review the neural network-based cancer prediction models for learning the gene expression features.<br />
<br />
'''Cancer prediction models'''<br><br />
Cancer prediction models often contain more than 1 method to achieve high prediction accuracy with a more accurate prognosis and it also aims to reduce the cost of patients.<br />
<br />
High dimensionality and spatial structure are the two main factors that can affect the accuracy of the cancer prediction models. They add irrelevant noisy features to our selected models. We have 3 ways to determine the accuracy of a model.<br />
<br />
The first is called ROC curve. It reflects the sensitivity of the response to the same signal stimulus under different criteria. To test its validity, we need to consider it with the confidence interval. Usually, a model is a good one when its ROC is greater than 0.7. Another way to measure the performance of a model is to use CI, which explains the concordance probability of the predicted and observed survival. The closer its value to 0.7, the better the model is. The third measurement method is using the Brier score. A brier score measures the average difference between the observed and the estimated survival rate in a given period of time. It ranges from 0 to 1, and a lower score indicates higher accuracy.<br />
<br />
== Neural network-based cancer prediction models ==<br />
By performing an extensive search relevant to neural network-based cancer prediction using Google scholar and other electronic databases namely PubMed and Scopus with keywords such as “Neural Networks AND Cancer Prediction” and “gene expression clustering”, the chosen papers covered cancer classification, discovery, survivability prediction and the statistical analysis models. The following figure 1 shows a graph representing the number of citations including filtering, predictive and clustering for chosen papers. [[File:f1.png]]<br />
<br />
'''Datasets and preprocessing''' <br><br />
Most studies investigating automatic cancer prediction and clustering used datasets such as the TCGA, UCI, NCBI Gene Expression Omnibus and Kentridge biomedical databases. There are a few of techniques used in processing dataset including removing the genes that have zero expression across all samples, Normalization, filtering with p value > 10^-05 to remove some unwanted technical variation and log2 transformations. Statistical methods, neural network, were applied to reduce the dimensionality of the gene expressions by selecting a subset of genes. Principle Component Analysis (PCA) can also be used as an initial preprocessing step to extract the datasets features. The PCA method linearly transforms the dataset features into lower dimensional space without capturing the complex relationships between the features. However, simply removing the genes that were not measured by the other datasets could not overcame the class imbalance problem. In that case, one research used Synthetic Minority Class Over Sampling method to generate synthetic minority class samples, which may lead to sparse matrix problem. Clustering was also applied in some studies for labeling data by grouping the samples into high-risk, low-risk groups and so on. <br />
<br />
The following table presents the dataset used by considered reference, the applied normalization technique, the cancer type and the dimensionality of the datasets.<br />
[[File:Datasets and preprocessing.png]]<br />
<br />
'''Neural network architecture''' <br><br />
Most recent studies reveal that filtering, predicting methods and cluster methods are used in cancer prediction. For filtering, the resulted features are used with statistical methods or machine learning classification and cluster tools such as decision trees, K Nearest Neighbor and Self Organizing Maps(SOM) as figure 2 indicates.[[File:filtering gane.png]]<br />
<br />
All the neurons in the neural network work together as feature detectors to learn the features from the input. For our categorization into filtering, predicting and clustering methods was based on the overall rule that a neural network performs in the cancer prediction method. Filtering methods are trained to remove the input’s noise and to extract the most representative features that best describe the unlabeled gene expressions. Predicting methods are trained to extract the features that are significant to prediction, therefore its objective functions measure how accurately the network is able to predict the class of an input. Clustering methods are trained to divide unlabeled samples into groups based on their similarities.<br />
<br />
'''Building neural networks-based approaches for gene expression prediction''' <br><br />
According to our survey, the representative codes are generated by filtering methods with dimensionality M smaller or equal to N, where N is the dimensionality of the input. Some other machine learning algorithm such as naïve Bayes or k-means can be used together with the filtering.<br />
Predictive neural networks are supervised, which find the best classification accuracy; meanwhile, clustering methods are unsupervised, which group similar samples or genes together. <br />
The goal of training prediction is to enhance the classification capability, and the goal of training classification is to find the optimal group to a new test set with unknown labels.<br />
<br />
'''Neural network filters for cancer prediction''' <br><br />
In the preprocessing step to classification, clustering and statistical analysis, the autoencoders are more and more commonly-used, to extract generic genomic features. An autoencoder is composed of the encoder part and the decoder part. The encoder part is to learn the mapping between high-dimensional unlabeled input I(x) and the low-dimensional representations in the middle layer(s), and the decoder part is to learn the mapping from the middle layer’s representation to the high-dimensional output O(x). The reconstruction of the input can take the Root Mean Squared Error (RMSE) or the Logloss function as the objective function. <br />
<br />
$$ RMSE = \sqrt{ \frac{\sum{(I(x)-O(x))^2}}{n} } $$<br />
<br />
$$ Logloss = \sum{(I(x)log(O(x)) + (1 - I(x))log(1 - O(x)))} $$<br />
<br />
There are several types of autoencoders, such as stacked denoising autoencoders, contractive autoencoders, sparse autoencoders, regularized autoencoders and variational autoencoders. The architecture of the networks varies in many parameters, such as depth and loss function. Each example of an autoencoder mentioned above has different number of hidden layers, different activation functions (e.g. sigmoid function, exponential linear unit function), and different optimization algorithms (e.g. stochastic gradient decent optimization, Adam optimizer).<br />
<br />
The neural network filtering methods were used by different statistical methods and classifiers. The conventional methods include Cox regression model analysis, Support Vector Machine (SVM), K-means clustering, t-SNE and so on. The classifiers could be SVM or AdaBoost or others.<br />
<br />
By using neural network filtering methods, the model can be trained to learn low-dimensional representations, remove noises from the input, and gain better generalization performance by re-training the classifier with the newest output layer.<br />
<br />
'''Neural network prediction methods for cancer''' <br><br />
The prediction based on neural networks can build a network that maps the input features to an output with a number of neurons, which could be one or two for binary classification, or more for multi-class classification. It can also build several independent binary neural networks for the multi-class classification, where the technique called “one-hot encoding” is applied.<br />
<br />
The codeword is a binary string C’k of length k whose j’th position is set to 1 for the j’th class, while other positions remain 0. The process of the neural networks is to map the input to the codeword iteratively, whose objective function is minimized in each iteration.<br />
<br />
Such cancer classifiers were applied on identify cancerous/non-cancerous samples, a specific cancer type, or the survivability risk. MLP models were used to predict the survival risk of lung cancer patients with several gene expressions as input. The deep generative model DeepCancer, the RBM-SVM and RBM-logistic regression models, the convolutional feedforward model DeepGene, Extreme Learning Machines (ELM), the one-dimensional convolutional framework model SE1DCNN, and GA-ANN model are all used for solving cancer issues mentioned above. This paper indicates that the performance of neural networks with MLP architecture as classifier are better than those of SVM, logistic regression, naïve Bayes, classification trees and KNN.<br />
<br />
'''Neural network clustering methods in cancer prediction''' <br><br />
Neural network clustering belongs to unsupervised learning. The input data are divided into different groups according to their feature similarity.<br />
The single-layered neural network SOM, which is unsupervised and without backpropagation mechanism, is one of the traditional model-based techniques to be applied on gene expression data. The measurement of its accuracy could be Rand Index (RI), which can be improved to Adjusted Random Index (ARI) and Normalized Mutation Information (NMI).<br />
<br />
$$ RI=\frac{TP+TN}{TP+TN+FP+FN}$$<br />
<br />
In general, gene expression clustering considers either the relevance of samples-to-cluster assignment or that of gene-to-cluster assignment, or both. To solve the high dimensionality problem, there are two methods: clustering ensembles by running a single clustering algorithm for several times, each of which has different initialization or number of parameters; and projective clustering by only considering a subset of the original features.<br />
<br />
SOM was applied on discriminating future tumor behavior using molecular alterations, whose results were not easy to be obtained by classic statistical models. Then this paper introduces two ensemble clustering frameworks: Random Double Clustering-based Cluster Ensembles (RDCCE) and Random Double Clustering-based Fuzzy Cluster Ensembles (RDCFCE). Their accuracies are high, but they have not taken gene-to-cluster assignment into consideration.<br />
<br />
Also, the paper provides double SOM based Clustering Ensemble Approach (SOM2CE) and double NG-based Clustering Ensemble Approach (NG2CE), which are robust to noisy genes. Moreover, Projective Clustering Ensemble (PCE) combines the advantages of both projective clustering and ensemble clustering, which is better than SOM and RDCFCE when there are irrelevant genes.<br />
<br />
== Summary ==<br />
<br />
Cancer is a disease with a very high fatality rate that spreads worldwide, and it’s essential to analyze gene expression for discovering gene abnormalities and increasing survivability as a consequence. The previous analysis in the paper reveals that neural networks are essentially used for filtering the gene expressions, predicting their class, or clustering them.<br />
<br />
Neural network filtering methods are used to reduce the dimensionality of the gene expressions and remove their noise. In the article, the authors recommended deep architectures more than shallow architectures for best practice as they combine many nonlinearities. <br />
<br />
Neural network prediction methods can be used for both binary and multi-class problems. In binary cases, the network architecture has only one or two output neurons that diagnose a given sample as cancerous or non-cancerous, while the number of the output neurons in multi-class problems is equal to the number of classes. The authors suggested that the deep architecture with convolution layers which was the most recently used model proved efficient capability and in predicting cancer subtypes as it captures the spatial correlations between gene expressions.<br />
Clustering is another analysis tool that is used to divide the gene expressions into groups. The authors indicated that a hybrid approach combining both the ensembling clustering and projective clustering is more accurate than using single-point clustering algorithms such as SOM since those methods do not have the capability to distinguish the noisy genes.<br />
<br />
==Discussion==<br />
There are some technical problems that can be considered and improved for building new models. <br><br />
<br />
1. Overfitting: Since gene expression datasets are high dimensional and have a relatively small number of samples, it would be likely to properly fits the training data but not accurate for test samples due to the lack of generalization capability. The ways to avoid overfitting can be: (1). adding weight penalties using regularization; (2). using the average predictions from many models trained on different datasets; (3). dropout. (4) Augmentation of the dataset to produce more "observations".<br><br />
<br />
2. Model configuration and training: In order to reduce both the computational and memory expenses but also with high prediction accuracy, it’s crucial to properly set the network parameters. The possible ways can be: (1). proper initialization; (2). pruning the unimportant connections by removing the zero-valued neurons; (3). using ensemble learning framework by training different models using different parameter settings or using different parts of the dataset for each base model; (4). Using SMOTE for dealing with class imbalance on the high dimensional level. <br><br />
<br />
3. Model evaluation: Braga-Neto and Dougherty in their research revealed that cross-validation displayed excessive variance and therefore it is unreliable for small size data. The bootstrap method proved more accurate predictability.<br><br />
<br />
4. Study producibility: A study needs to be reproducible to enhance research reliability so that others can replicate the results using the same algorithms data and methodology.<br />
<br />
==Conclusion==<br />
This paper reviewed the most recent neural network-based cancer prediction models and gene expression analysis tools. The analysis indicates that the neural network methods are able to serve as filters, predictors, and clustering methods, and also showed that the role of the neural network determines its general architecture. To give suggestions for future neural network-based approaches, the authors highlighted some critical points that have to be considered such as overfitting and class imbalance, and suggest choosing different network parameters or combining two or more of the presented approaches. One of the biggest challenges for cancer prediction modelers is deciding on the network architecture (i.e. the number of hidden layers and neurons), as there are currently no guidelines to follow to obtain high prediction accuracy.<br />
<br />
==Critiques==<br />
<br />
While results indicate that the functionality of the neural network determines its general architecture, the decision on the number of hidden layers, neurons, hypermeters and learning algorithm is made using trial-and-error techniques. Therefore improvements in this area of the model might need to be explored in order to obtain better results and in order to make more convincing statements.<br />
<br />
==Reference==<br />
Daoud, M., & Mayo, M. (2019). A survey of neural network-based cancer prediction models from microarray data. Artificial Intelligence in Medicine, 97, 204–214.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playing&diff=45533what game are we playing2020-11-21T18:14:31Z<p>Hhalim: /* Rock, Paper, Scissors */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
Recently, there have been many different studies of methods using AI to solve large-scale, zero-sum, extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, and the objective is just finding the optimal strategy for the game. This scenario is unrealistic since most of the time parameters of the game are unknown. This paper proposes a framework for finding an optimal solution using a primal-dual Newton Method, and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
An example architecture is presented below:<br />
<br />
[[File:Framework.png ]]<br />
<br />
Payoff matices <math> P </math> is parameterized by a domain-dependent low dimensional vector <math> \phi </math>, in which <math> \phi </math> depends on a differentiable function <math> M_1(x) </math>. Furthermore, <math> P </math> is applied to QRE to get the equilibrium strategies <math> (u^∗, v^∗) </math>. Lastly, loss function is calculated after applying through any differentiable <math> M_2(u^∗, v^∗) </math>.<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defense game.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' <math>P</math> that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
Here, we consider the case where <math>P</math> is not known a priori. The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium <math>(u^*,v^*) </math>. At this equilibrium, the players do not have anything to gain by changing their strategy, so this point is a stable state of the system. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, which depends on some external content <math> x </math>, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict, discontinuous with respect to P, and may not be unique. To address these issues, model the players' actions with the '''quantal response equilibria''' (QRE), where noise is added to the payoff matric. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "'''Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix we can instead represent it as a simple decision tree (for a ''single'' drawn card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance to what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of an information set for player 1 is the set of all of nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King) and all 13 of these nodes are indistinguishable to player 1, and so form an information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> is size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minmax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from an information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the minmax problem will have a unique solution and ''the objective function is continuous and continuously differntiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers my be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minmax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance to the QRE.<br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
For Rock, Paper, Scissors, the best strategy to reach a Nash equilibrium and a Quantal response equilibrium is to uniformly play each hand with equal odds.<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell 1) both parameters learned and predicted strategies improve with larger dataset; and 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. In this case, they needed to know the player’s perceived or believed distribution of cards may be different from the distribution of cards dealt. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game setting by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
Unsurprisingly, the results of this study show that in general the quality of learned parameters improved as the number of observations increased. The network presented in this paper demonstrated improvement over existing methodology. <br />
<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed, and extensions to the area of learning general-sum games.<br />
<br />
== Critiques ==<br />
The proposed method appears to suffer from two flaws. Firstly, the assumption that players behave in accordance to the QRE severely limits the space of player strategies, and is known to exhibit pathological behaviour even in one-player settings. Second, the solvers are computationally inefficient and are unable to scale.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_Classification&diff=45506Streaming Bayesian Inference for Crowdsourced Classification2020-11-21T17:18:20Z<p>Hhalim: /* Empirical Analysis */</p>
<hr />
<div>Group 4 Paper Presentation Summary<br />
<br />
By Jonathan Chow, Nyle Dharani, Ildar Nasirov<br />
<br />
== Motivation ==<br />
Crowdsourcing can be a useful tool for data generation in classification projects. Often this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon's Mechanical Turk. In theory, it is effective in processing high volumes of small tasks that would be expensive to achieve otherwise.<br />
<br />
The primary limitation with this form of acquiring data is that respondents are liable to submit incorrect responses. This results in datasets that are noisy and unreliable.<br />
<br />
However, the integrity of the data is then limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings.<br />
<br />
== Dawid-Skene Model for Crowdsourcing ==<br />
The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task <math>i</math> in set <math>M</math>, let the ground-truth be the binary <math>y_i = {\pm 1}</math>. We get labels <math>X = {x_{ij}}</math> where <math>j \in N</math> is the index for that worker.<br />
<br />
At each time step <math>t</math>, a worker <math>j = a(t) </math> provides their label for an assigned task <math>i</math> and provides the label<math>x_{ij} = {\pm 1}</math>. We denote responses up to time <math>t</math> via superscript.<br />
<br />
We let <math>x_{ij} = 0</math> if worker <math>j</math> has not completed task <math>i</math>. We assume that <math>P(x_{ij} = y_i) = p_j</math>. This implies that each worker is independent and has equal probability of correct labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.<br />
<br />
Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by <center><math>\pi_{uni}(t) = argmin_{i \notin M_{a(t)}^t}\{ | N_i^t | \}.</math></center><br />
<br />
Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by <center><math>\pi_{us}(t) = argmin_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}.</math></center><br />
<br />
We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, <math>\hat{y}_i = \text{sign}\{\sum_{j \in N_i} x_{ij}\}</math>.<br />
<br />
== Streaming Bayesian Inference for Crowdsourced Classification (SBIC) ==<br />
The aim of the SBIC algorithm is to estimate the posterior probability, <math>P(y, p | X^t, \theta)</math> where <math>X^t</math> are the observed responses at time <math>t</math> and <math>\theta</math> is our prior. We can then generate predictions <math>\hat{y}^t</math> as the marginal probability over each <math>y_i</math> given <math>X^t</math>, and <math>\theta</math>.<br />
<br />
We factor <math>P(y, p | X^t, \theta) \approx \prod_{I \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j) </math> where <math>\mu_i^t</math> corresponds to each task and <math>\nu_j^t</math> to each worker.<br />
<br />
We then sequentially optimize the factors <math>\mu^t</math> and <math>\nu^t</math>. We begin by assuming that the worker accuracy follows a beta distribution with parameters <math>\alpha</math> and <math>\beta</math>. Initialize the task factors <math>\mu_i^0(+1) = q</math> and <math>\mu_i^0(-1) = 1 – q</math> for all <math>i</math>.<br />
<br />
When a new label is observed at time <math>t</math>, we update the <math>\nu_j^t</math> of worker <math>j</math>. We then update <math>\mu_i</math>. These updates are given by<br />
<br />
<center><math>\nu_j^t(p_j) \sim \text{Beta}(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta) </math></center><br />
<br />
<center><math>\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}</math></center><br />
where <math>\hat{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }</math>.<br />
<br />
We choose our predictions to be the maximum <math>\mu_i^t(k) </math> for <math>k= \{-1,1\}</math>.<br />
<br />
Depending on our ordering of labels <math>X</math>, we can select for different applications.<br />
<br />
== Fast SBIC ==<br />
The pseudocode for Fast SBIC is shown below.<br />
<br />
<center>[[Image:FastSBIC.png|800px|]]</center><br />
<br />
As the name implies, the goal of this algorithm is speed. To facilitate this, we leave the order of <math>X</math> unchanged.<br />
<br />
We express <math>\mu_i^t</math> in terms of its log-odds<br />
<center><math>z_i^t = \log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} \log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })</math></center><br />
where <math>z_i^0 = \log(\frac{q}{1 - q})</math>.<br />
<br />
The product chain then becomes a summation and removes the need to normalize each <math>\mu_i^t</math>. We use these log-odds to compute worker accuracy,<br />
<br />
<center><math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \sigma(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}</math></center><br />
where <math>\sigma(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1) </math><br />
<br />
The final predictions are made by choosing class <math>\hat{y}_i^T = \text{sign}(z_i^T) </math>. We see later that Fast SBIC has similar computational speed to majority voting.<br />
<br />
== Sorted SBIC ==<br />
To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.<br />
<br />
<center>[[Image:SortedSBIC.png|800px|]]</center><br />
<br />
From the general discussion of SBIC, we know that predictions on task <math>i</math> are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on <math>\mu_i^t</math> and <math>\nu_j^t</math> to move them further from their prior. This means that task <math>i</math> is predicted more accurately when its corresponding labels are seen closer to the end of the process.<br />
<br />
We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.<br />
<br />
We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks. However, we can reduce the complexity by viewing <math>s^k</math> across different tasks in an offline setting when the whole dataset is known in advance.<br />
<br />
== Theoretical Analysis ==<br />
The authors prove an exponential relationship between the error probability and the number of labels per task. The two theorems, for the different sampling regimes, are presented below.<br />
<br />
<center>[[Image:Theorem1.png|800px|]]</center><br />
<br />
<center>[[Image:Theorem2.png|800px|]]</center><br />
<br />
== Empirical Analysis ==<br />
The purpose of the empirical analysis is to compare SBIC to the existing state of the art algorithms. The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, Monte-Carlo sampling, and triangular estimation. <br />
<br />
First of all, the algorithms are run on a synthetic data that meets the assumptions of an underlying one-coin Dawid-Skene model, which allows the authors to compare SBIC's performance empirically with the theoretical results oreviously shown. <br />
<br />
<center>[[Image:RealWorldResults.png|800px|]]</center><br />
<br />
In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction errors to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.<br />
<br />
The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.<br />
<br />
<center>[[Image:TimeRequirement.png|800px|]]</center><br />
<br />
== Conclusion and Future Research ==<br />
In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. The authors intend to extend the algorithm to the multi-class case in the future.<br />
<br />
== Critique ==<br />
In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure <math>X</math> is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended.<br />
<br />
The paper is tackling the classic problem of aggregating labels in a crowdsourced application, with a focus on speed. The algorithms proposed are fast and simple to implement and come with theoretical guarantees on the bounds for error rates. However, the paper starts with an objective of designing fast label aggregation algorithms for a streaming setting yet doesn’t spend any time motivating the applications in which such algorithms are needed. All the datasets used in the empirical analysis are static datasets therefore for the paper to be useful, the problem considered should be well motivated. It also appears that the output from the algorithm depends on the order in which the data are processed, which may need some should be clarified. Finally, the theoretical results are presented under the assumption that the predictions of the FBI converge to the ground truth, however, the reasoning behind this assumption is not explained.<br />
<br />
The paper assumes that crowdsourcing from human-being is systematic: that is, respondents to the problems would act in similar ways that can be classified into some categories. There are lots of other factors that need to be considered for a human respondent, such as fatigue effects and conflicts of interest. Those factors would seriously jeopardize the validity of the results and the model if they were not carefully designed and taken care of. For example, one formally accurate subject reacts badly to the subject one day generating lots of faulty data, and it would take lots of correct votes to even out the effects. Even in lots of medical experiment that involves human subjects, with rigorous standard and procedure, the result could still be invalid. The trade-off for speed while sacrificing the validity is not wise.<br />
<br />
== References ==<br />
[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Gtompkin&diff=45431User:Gtompkin2020-11-20T22:12:36Z<p>Hhalim: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Grace Tompkins, Tatiana Krikella, Swaleh Hussain<br />
<br />
== Introduction ==<br />
<br />
One of the fundamental challenges in machine learning and data science is dealing with missing and incomplete data. This paper proposes theoretically justified methodology for using incomplete data in neural networks, eliminating the need for direct completion of the data by imputation or other commonly used methods in existing literature. The authors propose identifying missing data points with a parametric density and then training it together with the rest of the network's parameters. The neuron's response at the first hidden layer is generalized by taking its expected value to process this probabilistic representation. This process is essentially calculating the average activation of the neuron over imputations drawn from the missing data's density. The proposed approach is advantageous as it has the ability to train neural networks using incomplete observations from datasets, which are ubiquitous in practice. This approach also requires minimal adjustments and modifications to existing architectures. Theoretical results of this study show that this process does not lead to a loss of information, while experimental results showed the practical uses of this methodology on several different types of networks.<br />
<br />
== Related Work ==<br />
<br />
Currently, dealing with incomplete inputs in machine learning requires filling absent attributes based on complete, observed data. Two commonly used methods are mean imputation and k-NN imputation. Other methods for dealing with missing data involve training separate neural networks, extreme learning machines, and <math>k</math>-nearest neighbours. Probabilistic models of incomplete data can also be built depending on the mechanism missingness (i.e. whether the data is Missing At Random (MAR), Missing Completely At Random (MCAR), or Missing Not At Random (MNAR)), which can be fed into a particular learning model. Further, the decision function can also be trained using available/visible inputs alone. Previous work using neural networks for missing data includes a paper by Bengio and Gringras [1] where the authors used recurrent neural networks with feedback into the input units to fill absent attributes solely to minimize the learning criterion. Goodfellow et. al. [2] also used neural networks by introducing a multi-prediction deep Boltzmann machine which could perform classification on data with missingness in the inputs.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
In this approach, the adaptation of a given neural network to incomplete data relies on two steps: the estimation of the missing data and the generalization of the neuron's activation. <br />
<br />
Let <math>(x,J)</math> represent a missing data point, where <math>x \in \mathbb{R}^D </math>, and <math>J \subset {1,...,D} </math> is a set of attributes with missing data.<br />
<br />
For each missing point <math>(x,J)</math>, define an affine subspace consisting of all points which coincide with <math>x</math> on known coordinates <math>J'=\{1,…,N\}/J</math>: <br />
<br />
<center><math>S=Aff[x,J]=span(e_J) </math></center> <br />
where <math>e_J=[e_j]_{j\in J}</math> and <math>e_j</math> is the <math> j^{th}</math> canonical vector in <math>\mathbb{R}^D </math>.<br />
<br />
Assume that the missing data points come from the D-dimensional probability distribution, <math>F</math>. In their approach, the authors assume that the data points follow a mixture of Gaussians (GMM) with diagonal covariance matrices. By choosing diagonal covariance matrices, the number of model parameters is reduced. To model the missing points <math>(x,J)</math>, the density <math>F</math> is restricted to the affine subspace <math>S</math>. Thus, possible values of <math>(x,J)</math> are modelled using the conditional density <math>F_S: S \to \mathbb{R} </math>, <br />
<br />
<center><math>F_S(x) = \begin{cases}<br />
\frac{1}{\int_{S} F(s) \,ds}F(x) & \text{if $x \in S$,} \\<br />
0 & \text{otherwise.}<br />
\end{cases} </math></center><br />
<br />
To process the missing data by a neural network, the authors propose that only the first hidden layer needs modification. Specifically, they generalize the activation functions of all the neurons in the first hidden layer of the network to process the probability density functions representing the missing data points. For the conditional density function <math>F_S</math>, the authors define the generalized activation of a neuron <math>n: \mathbb{R}^D \to \mathbb{R}</math> on <math>F_S </math> as: <br />
<br />
<center><math>n(F_S)=E[n(x)|x \sim F_S]=\int n(x)F_S(x) \,dx</math>,</center> <br />
provided that the expectation exists. <br />
<br />
The following two theorems describe how to apply the above generalizations to both the ReLU and the RBF neurons, respectively. <br />
<br />
'''Theorem 3.1''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians. Given weights <math>w=(w_1, ..., w_D) \in \mathbb{R}^D,</math><math> b \in \mathbb{R} </math>, we have<br />
<br />
<center><math>\text{ReLU}_{w,b}(F)=\sum_i{p_iNR\big(\frac{w^{\top}m_i+b}{\sqrt{w^{\top}\Sigma_iw}}}\big)</math></center> <br />
<br />
where <math>NR(x)=\text{ReLU}[N(x,1)]</math> and <math>\text{ReLU}_{w,b}(x)=\text{max}(w^{\top}+b, 0)</math>, <math>w \in \mathbb{R}^D </math> and <math> b \in \mathbb{R}</math> is the bias.<br />
<br />
'''Theorem 3.2''' Let <math>F = \sum_i{p_iN(m_i, \Sigma_i)}</math> be the mixture of (possibly degenerate) Gaussians and let the RBF unit be parametrized by <math>N(c, \Gamma) </math>. We have: <br />
<br />
<center><math>\text{RBF}_{c, \Gamma}(F) = \sum_{i=1}^k{p_iN(m_i-c, \Gamma+\Sigma_i)}(0)</math>.</center> <br />
<br />
In the case where the data set contains no missing values, the generalized neurons reduce to classical ones, since the distribution <math>F</math> is only used to estimate possible values at missing attributes. However, if one wishes to use an incomplete data set in the testing stage, then an incomplete data set must be used to train the model.<br />
<br />
<math> </math><br />
<br />
== Theoretical Analysis ==<br />
<br />
The main theoretical results, which are summarized below, show that using generalized neuron's activation at the first layer does not lead to the loss of information. <br />
<br />
Let the generalized response of a neuron <math>n: \mathbb{R}^D \rightarrow \mathbb{R}</math> evaluated on a probability measure <math>\mu</math> which is given by <br />
<center><math>n(\mu) := \int n(x)d\mu(x)</math></center>.<br />
<br />
Theorem 4.1 shows that a neural network with generalized ReLU units is able to identify any two probability measures. The proof presented by the authors uses the Universal Approximation Property (UAP), and is summarized as follows. <br />
<br />
<br />
'''Theorem 4.1.''' Let <math>\mu</math>, <math>v</math> be probabilistic measures satisfying <math>\int ||x|| d \mu(x) < \infty</math>. If <br />
<center><math>ReLU_{w,b}(\mu) = ReLU_{w,b}(\nu) \text{ for } w \in \mathbb{R}^D, b \in \mathbb{R}</math></center> then <math>\nu = \mu.</math><br />
<br />
''Sketch of Proof'' Let <math>w \in \mathbb{R}^D</math> be fixed and define the set <center><math>F_w = \{p: \mathbb{R} \rightarrow \mathbb{R}: \int p(w^Tx)d\mu(x) = \int p(w^Tx)d\nu(x)\}.</math></center> The first step of the proof involves showing that <math>F_w</math> contains all continuous and bounded functions. The authors show this by showing that a piecewise continuous function that is affine linear on specific intervals, <math>Q</math>, is in the set <math>F_w</math>. This involves re-writing <math>Q</math> as a sum of tent-like piecewise linear functions, <math>T</math> and showing that <math>T \in F_w</math> (since it is sufficient to only show <math>T \in F_w</math>). <br />
<br />
Next, the authors show that an arbitrary bounded continuous function <math>G</math> is in <math>F_w</math> by the Lebesgue dominated convergence theorem. <br />
<br />
Then, as <math>cos(\cdot), sin(\cdot) \in F_w</math>, the function <center><math>exp(ir) = cos(r) + sin(r) \in F_w</math></center> and we have the equality <center><math>\int exp(iw^Tx)d\mu(x) = \int exp(iw^Tx)d\nu(x).</math></center> Since <math>w</math> was arbitrarily chosen, we can conclude that <math>\mu = \nu</math> <br />
as the characteristic functions of the two measures coincide. <br />
<br />
<br />
More general results can be obtained making stronger assumptions on the probability measures, for example if a given family of neurons satisfies UAP, then their generalization can identify any probability measure with compact support.<br />
<br />
== Experimental Results ==<br />
The model was applied to three types of algorithms: an Autoencoder (AE), a multilayer perceptron, and a radial basis function network.<br />
<br />
'''Autoencoder'''<br />
<br />
Corrupted images were restored as a part of this experiment. Grayscale handwritten digits were obtained from the MNIST database. A 13 by 13 (169 pixels) square was removed from each 28 by 28 (784 pixels) image. The location of the square was uniformly sampled for each image. The autoencoder used included 5 hidden layers. The first layer used ReLU activation functions while the subsequent layers utilized sigmoids. The loss function was computed using pixels from outside the mask. <br />
<br />
Popular imputation techniques were compared against the conducted experiment:<br />
<br />
''k-nn:'' Replaced missing features with the mean of respective features calculated using K nearest training samples. Here, K=5. <br />
<br />
''mean:'' Replaced missing features with the mean of respective features calculated using all incomplete training samples.<br />
<br />
''dropout:'' Dropped input neutrons with missing values. <br />
<br />
Moreover, a context encoder (CE) was trained by replacing missing features with their means. Unlike mean imputation, the complete data was used in the training phase. The method under study performed better than the imputation methods inside and outside the mask. Additionally, the method under study outperformed CE based on the whole area and area outside the mask. <br />
<br />
'''Multilayer Perceptron'''<br />
<br />
A multilayer perceptron with 3 ReLU hidden layers was applied to a multi-class classification problem on the Epileptic Seizure Recognition (ESR) data set taken from [3]. Each 178-dimensional vector (out of 11500 samples) is the EEG recording of a given person for 1 second, categorized into one of 5 classes. To generate missing attributes, 25%, 50%, 75%, and 90% of observations were randomly removed. The aforementioned imputation methods were used in addition to Multiple Imputation by Chained Equation (mice) and a mixture of Gaussians (gmm). The former utilizes the conditional distribution of data by Markov chain Monte Carlo techniques to draw imputations. The latter replaces missing features with values sampled from GMM estimated from incomplete data using the EM algorithm. <br />
<br />
Double 5-fold cross-validation was used to report classification results. The model under study outperformed classical imputation methods, which give reasonable results only for a low number of missing values. The method under study performs nearly as well as CE, even though CE had access to complete training data. <br />
<br />
'''Radial Basis Function Network'''<br />
<br />
RBFN can be considered as a minimal architecture implementing our model, which contains only one hidden layer. A cross-entropy function was applied to a softmax in the output layer. Two-class data sets retrieved from the UCI repository [4] with internally missing attributes were used. Since the classification is binary, two additional SVM kernel models which work directly with incomplete data without performing any imputations were included in the experiment:<br />
<br />
''geom:'' The objective function is based on the geometric interpretation of the margin and aims to maximize the margin of each sample in its own subspace [5].<br />
<br />
''karma:'' This algorithm iteratively tunes kernel classifier under low-rank assumptions [6].<br />
<br />
The above SVM methods were combined with RBF kernel function. The number of RBF units was selected in the inner cross-validation from the range {25, 50, 75, 100}. Initial centers of RBFNs were randomly selected from training data while variances were samples from N(0,1) distribution. For SVM methods, the margin parameter <math>C</math> and kernel radius <math>\gamma</math> were selected from <math>\{2^k :k=−5,−3,...,9\}</math> for both parameters. For karma, additional parameter <math>\gamma_{karma}</math> was selected from the set <math>\{1, 2\}</math>.<br />
<br />
The model under study outperformed imputation techniques in almost all cases. It partially confirms that the use of raw incomplete data in neural networks is a usually better approach than filling missing attributes before the learning process. Moreover, it obtained more accurate results than modified kernel methods, which directly work on incomplete data. The performance of the model was once again comparable to, and in some cases better than CE, which had access to the complete data.<br />
<br />
== Conclusion ==<br />
<br />
The results with these experiments along with the theoretical results conclude that this novel approach for dealing with missing data through a modification of a neural network is beneficial and outperforms many existing methods. This approach, which utilizes representing missing data with a probability density function, allows a neural network to determine a more generalized and accurate response of the neuron.<br />
<br />
== Critiques ==<br />
<br />
- A simulation study where the mechanism of missingness is known will be interesting to examine. Doing this will allow us to see when the proposed method is better than existing methods, and under what conditions.<br />
<br />
- This method of imputing incomplete data has many limitations: in most cases when we have a missing data point we are facing a relatively small amount of data that does not require training of a neural network. For a large dataset, missing records does not seem to be very crucial because obtaining data will be relatively easier, and using an empirical way of imputing data such as a majority vote will be sufficient.<br />
<br />
- An interesting application of this problem is in NLP. In NLP, especially Question Answering, there is a problem where a query is given and an answer must be retrieved, but the knowledge base is incomplete. There is therefore a requirement for the model to be able to infer information from the existing knowledge base in order to answer the question. Although this problem is a little more contrived than the one mentioned here, it is nevertheless similar in nature because it requires the ability to probabilistically determine some value which can then be used as a response.<br />
<br />
- The experiments in this paper evaluate this method against low amounts of missing data. It would be interesting to see the properties of this imputation when a majority of the data is missing, and see if this method can outperform dropout training in this setting (dropout is known to be surprisingly robust even at high drop levels).<br />
<br />
- This problem can possibly be applied to face recognition where given a blurry image of a person's face, the neural network can make the image clearer such that the face of the person would be visible for humans to see and also possible for the software to identify who the person is.<br />
<br />
== References ==<br />
[1] Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous<br />
data. In Advances in neural information processing systems, pages 395–401, 1996.<br />
<br />
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.<br />
<br />
[3] Ralph G Andrzejak, Klaus Lehnertz, Florian Mormann, Christoph Rieke, Peter David, and Christian E Elger. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Physical Review E, 64(6):061907, 2001.<br />
<br />
[4] Arthur Asuncion and David J. Newman. UCI Machine Learning Repository, 2007.<br />
<br />
[5] Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9:1–21, 2008.<br />
<br />
[6] Elad Hazan, Roi Livni, and Yishay Mansour. Classification with low rank and missing data. In Proceedings of The 32nd International Conference on Machine Learning, pages 257–266, 2015.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_Processes&diff=45430Influenza Forecasting Framework based on Gaussian Processes2020-11-20T21:25:44Z<p>Hhalim: /* Critique */</p>
<hr />
<div><br />
== Abstract ==<br />
<br />
This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Resulting retrospective forecasts, trained on a subset of the publicly available CDC influenza-like-illness (ILI) data-set, outperformed four state-of-the-art models when compared using the official CDC scoring rule (log-score).<br />
<br />
== Background ==<br />
<br />
Each year, the seasonal influenza epidemic affects public health at a massive scale, resulting in 38 million cases, 400 000 hospitalizations, and 22 000 deaths in the United States in 2019-2020 alone [1]. Given this, reliable forecasts of future influenza development are invaluable, because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.<br />
<br />
Given the process of data collection and surveillance lag, accurate statistics for influenza warning systems are often delayed by some margin of time, making early prediction imperative. However, there are challenges in long-term epidemic forecasting. First, the temporal dependency is hard to capture with short-term input data. Without manually added seasonal trends, most statistical models fail to provide high accuracy. Second, the influence from other locations has not been exhaustively explored with limited data input. Spatio-temporal effects would therefore require adequate data sources to achieve good performance.<br />
<br />
== Related Work ==<br />
<br />
Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has to lead to four states of the art models in the field; MSS, a physical susceptible-infected-recovered model with assumed linear noise [4]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.<br />
<br />
== Motivation ==<br />
<br />
It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework and the subsequent performance comparison to the benchmark models.<br />
<br />
== Gaussian Process Regression ==<br />
<br />
Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.<br />
<br />
<br />
Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have <br />
<br />
$$<br />
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))<br />
$$<br />
<br />
where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form <br />
<br />
$$<br />
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}^j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).<br />
$$<br />
<br />
It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves). <br />
<br />
By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form<br />
<br />
$$<br />
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))<br />
$$<br />
<br />
<br />
<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div><br />
<br />
<div align="center">'''Figure 1. Gaussian process regression''': Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [3]. </div><br />
<br />
== Data-set ==<br />
<br />
Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains <math>d_j^i</math> for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).<br />
<br />
To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.<br />
<br />
== Proposed Framework ==<br />
<br />
To compute <math>\hat{y}</math>, the following algorithm is executed. <br />
<br />
<ol><br />
<br />
<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).<br />
<br />
<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> <br />
<br />
<li> Train the Gaussian process<br />
<br />
<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math><br />
<br />
<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math><br />
<br />
<li> Repeat steps 2-5 for all sets of weeks <math>J</math><br />
<br />
<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets)<br />
<br />
<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math><br />
<br />
</ol><br />
<br />
== Results ==<br />
<br />
To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. In other words, the Gaussian process model was trained to assume a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome. <br />
<br />
To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.<br />
<br />
<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div><br />
<br />
<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''': One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div><br />
<br />
<br />
Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within an interval around the true value. <br />
<br />
<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div><br />
<br />
<div align="center">'''Figure 3. Average log-score gain of proposed framework''': Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.<br />
<br />
== Critique ==<br />
<br />
The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular; one can compute key parameters such as the peak infection incidence (<math>\hat{y} = max_{0 \leq j \leq 52} d^i_j </math>), the timing of the peak infection incidence (<math>\hat{y} = argmax_{0 \leq j \leq 52} d^i_j</math>) and the final epidemic size of a season (<math>\hat{y} = \sum_{j=1}^{52} d^i_j</math>). However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.<br />
<br />
This paper provides a state of the art approach for forecasting epidemics. It would have been interesting to see other types of kernels being used, such as a periodic kernel (<math>k(x, x') = \sigma^2 \exp{-\frac{2 \sin^2 (\pi|x-x'|/p}{l^2}} </math>), as intuitively epidemics are known to have waves within seasons. This may have resulted in better-calibrated uncertainty estimates as well.<br />
<br />
It is mentioned that the the framework might not be good for non-seasonal epidemics because it requires training data, given that the COVID-19 pandemic comes in multiple waves and we have enough data from the first wave and second wave, we might be able to use this framework to predict the third wave and possibly the fourth one as well. It'd be interesting to see this forecasting framework being trained using the data from the first and second wave of COVID-19.<br />
<br />
== References ==<br />
<br />
[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html<br />
<br />
[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.<br />
<br />
[3] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.<br />
<br />
[4] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45349Superhuman AI for Multiplayer Poker2020-11-19T02:45:30Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of this abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. It compares it with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[REGRET FACTOR]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
<br />
[[File:left.PNG| 425px | x215px |left]] [[File:right.PNG| 425px | x215px |right ]]<br />
<br />
<div align="center">Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.</div> <br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
<br />
== Discussion and Critiques ==<br />
<br />
The blueprint strategy for Pluribus uses two abstraction methods which reduces the computational power required. Thus Pluribus was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45347Superhuman AI for Multiplayer Poker2020-11-19T02:35:06Z<p>Hhalim: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of this abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. It compares it with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[REGRET FACTOR]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
<br />
[[File:left.PNG| 425px | x215px |left]] [[File:right.PNG| 425px | x215px |right ]]<br />
<br />
<div align="center">Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.</div> <br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
<br />
== Discussion and Critiques ==<br />
<br />
The blueprint strategy for Pluribus uses two abstraction methods which reduces the computational power required. Thus Pluribus was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using non-theoretical approach in more real life problems such as autonomous driving or stock market trading.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45345Superhuman AI for Multiplayer Poker2020-11-19T02:23:37Z<p>Hhalim: /* Discussion */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses a builtin strategy - “Blueprint strategy”, which it gradually improves by searching in real time in situations it finds itself in during the course of the game. In the first betting round pluribus uses the initial blueprint strategy when the number of decision points is small. Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of this abstraction: Action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented <br />
as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. It compares it with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[REGRET FACTOR]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T.<br />
<br />
An additional feature of Pluribus is that in the subgames, instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance if a player never bluffs while holding the best possible hand then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a balanced strategy than to bet.<br />
<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
<br />
[[File:left.PNG| 425px | x215px |left]] [[File:right.PNG| 425px | x215px |right ]]<br />
<br />
<div align="center">Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot.</div> <br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
<br />
== Discussion ==<br />
<br />
The blueprint strategy for Pluribus was computed<br />
in 8 days on a 64-core server for a total<br />
of 12,400 CPU core hours. It required less than<br />
512 GB of memory. At current cloud computing<br />
spot instance rates, this would cost about $144<br />
to produce. This is in sharp contrast to all the<br />
other recent superhuman AI milestones for games,<br />
which used large numbers of servers and/or farms<br />
of graphics processing units (GPUs)<br />
and computation would enable a finer-grained<br />
blueprint that would lead to better performance<br />
but would also result in Pluribus using more<br />
memory or being slower during real-time search.<br />
We set the size of the blueprint strategy abstraction<br />
to allow Pluribus to run during live play on a<br />
machine with no more than 128 GB of memory<br />
while storing a compressed form of the blueprint<br />
strategy in memory.<br />
<br />
Pluribus definitely show that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker was a widely recognized<br />
milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45235Superhuman AI for Multiplayer Poker2020-11-18T00:18:14Z<p>Hhalim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45214Superhuman AI for Multiplayer Poker2020-11-17T23:10:52Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45213Superhuman AI for Multiplayer Poker2020-11-17T23:09:54Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games.<br />
<br />
Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location.<br />
<br />
This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45212Superhuman AI for Multiplayer Poker2020-11-17T23:09:43Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games.<br />
<br />
Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 800px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location.<br />
<br />
This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45211Superhuman AI for Multiplayer Poker2020-11-17T23:09:35Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games.<br />
<br />
Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 200px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location.<br />
<br />
This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45210Superhuman AI for Multiplayer Poker2020-11-17T23:09:17Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently existing techniques to find ways to exploit an opponent require way too many samples and is not competitive enough outside of small games.<br />
<br />
Finding a Nash equilibrium in three or more players is a problem itself. If we can efficiently compute a Nash equilibrium in games more than two players, it is highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colours which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibrium, their own version of the equilibrium might not be a Nash equilibrium and thus they are not choosing the best possible location.<br />
<br />
This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Lemonade_Example.png&diff=45209File:Lemonade Example.png2020-11-17T23:06:42Z<p>Hhalim: </p>
<hr />
<div></div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45208Superhuman AI for Multiplayer Poker2020-11-17T22:58:42Z<p>Hhalim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=45202Superhuman AI for Multiplayer Poker2020-11-17T22:27:26Z<p>Hhalim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose.<br />
<br />
More specifically, in the game of poker we only have AI models that can beat human players in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games, and having one would have surprising implications in computational complexity theory.<br />
<br />
In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used are not guaranteed to converge to a Nash algorithm outside if two-player zero-sum games. However, it has uses a strong strategy that is capable of consistently defeating elite human professionals. Which shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
== Theoretical Analysis ==<br />
<br />
The core of Pluribus’s strategy was computed through self-play, in which the AI plays against copies of itself, without any data of human or prior AI play used as input. The AI player randomly and then gradually improves itself and its strategy. This self-play produces a strategy for the entire game offline, which we refer to as the blueprint strategy. During actual play against opponents, Pluribus improves upon the blueprint strategy by searching in real-time for the situations in which it finds itself during the game.<br />
<br />
Pluribus uses forms of abstraction to make computations scalable. There are far too many decision points (a decision point in the game is a point where the player chooses to call, fold or check) in no-limit Texas hold 'em poker, so to reduce the complexity of the game, we eliminate some actions from consideration and also bucket similar decision points together in a process called abstraction. After abstraction, the bucketed decision points are treated as identical.<br />
<br />
We use two kinds of abstraction in Pluribus: action abstraction and information abstraction. Action abstraction reduces the number of different actions the AI needs to consider. Pluribus only considers a few different bet sizes at any step. The exact number of bets it considers varies between 1 and 14 depending on the situation. The other form of abstraction that we use is information abstraction which drastically reduces the complexity of the game. Here, decision points that are similar in terms of what information has been revealed(in poker, the player’s cards and revealed board cards) are bucketed together. Therefore, during actual play against humans, Pluribus uses information abstraction only to reason about situations on future betting rounds, never the betting round that it is actually in. So information abstraction is also applied during offline self-play.<br />
<br />
'''Blueprint Strategy and CFR'''<br />
<br />
The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR). CFR is commonly used in imperfect information games AI, where it is trained by repeatedly playing against itself, where it gradually improves by beating earlier versions of itself. Poker is best represented by a game tree structure which is computationally favorable and a game tree is represented as a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken.<br />
<br />
[PICTURE OF GAME TREE]<br />
<br />
At the start of each iteration MCCFR stimulates a hand of poker randomly (Cards held by player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was valuable by comparing it with other actions available to the traverser at that point and by also by investigating and comparing it with the future hypothetical decisions that would have been made following the other available actions. Counterfactual Regret factor is used to asses a decision, which is the difference between what the traverser would have received for choosing an action and what the traverser in expectation actually received on the iteration. The value of counter factual regret for a decision is updated over the iterations as more scenarios or decision points are encountered.<br />
<br />
[Regret factor equation]<br />
<br />
Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with decision and zero regret indicates that you are indifferent. At the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret is chosen with higher probability. CFR minimizes regret over many iterations until the average strategy over all iterations converges and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations and so it assigns a weight of T to regret contributions at iteration T. <br />
<br />
<br />
Pluribus uses this blueprint strategy in the first betting round when the number of decision points is small enough. Into the future of the game, in the subgames instead of assuming that all players play according to a single strategy, pluribus considers that each player may choose between k different strategies specialized to each player, when a leaf node is reached. This results in the searcher finding a strategy that is more balanced because choosing an unbalanced strategy(e.g., always playing Rock in Rock-Paper-Scissors) would be punished by an opponent shifting to one of the other continuation strategies (e.g playing paper).<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.<br />
<br />
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44109Superhuman AI for Multiplayer Poker2020-11-14T17:26:06Z<p>Hhalim: /* Challenges of Multiplayer Games */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. More specifically, in the game of poker we only have AI models that can beat them in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44107Superhuman AI for Multiplayer Poker2020-11-14T17:17:42Z<p>Hhalim: /* Layer for Processing Missing Data */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. More specifically, in the game of poker we only have AI models that can beat them in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The problem with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44106Superhuman AI for Multiplayer Poker2020-11-14T17:16:22Z<p>Hhalim: /* Previous Work */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. More specifically, in the game of poker we only have AI models that can beat them in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.<br />
<br />
== Challenges of Multiplayer Games ==<br />
<br />
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
The problem with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44105Superhuman AI for Multiplayer Poker2020-11-14T17:08:03Z<p>Hhalim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. More specifically, in the game of poker we only have AI models that can beat them in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.<br />
<br />
== Previous Work ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44103Superhuman AI for Multiplayer Poker2020-11-14T16:58:31Z<p>Hhalim: /* Related Work */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
For many years, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. These games include checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
This means that developing a superhuman AI in a multiplayer setting is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.<br />
<br />
== Previous Work ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44102Superhuman AI for Multiplayer Poker2020-11-14T16:56:34Z<p>Hhalim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
For many years, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. These games include checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
This means that developing a superhuman AI in a multiplayer setting is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.<br />
<br />
== Related Work ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=44099Superhuman AI for Multiplayer Poker2020-11-14T16:49:44Z<p>Hhalim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
For many years, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. These games include checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.<br />
<br />
== Related Work ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_Poker&diff=43894Superhuman AI for Multiplayer Poker2020-11-13T18:30:36Z<p>Hhalim: Created page with "== Presented by == Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty == Introduction == Lorem Ipsum Bla bla bla == Related Work == Lorem Ipsum Bla bla..."</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Related Work ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Layer for Processing Missing Data ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Theoretical Analysis ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Experimental Results ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Discussion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Conclusion ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== Critiques ==<br />
<br />
Lorem Ipsum Bla bla bla<br />
<br />
== References ==<br />
[1] Lorem Ipsum Bla bla bla<br />
[2] Lorem Ipsum Bla bla bla</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=42791F21-STAT 441/841 CM 763-Proposal2020-10-21T20:43:12Z<p>Hhalim: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
Budaraju, Sai Arvind<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Convolution Neural Network for Rainy day Prediction<br />
<br />
'''Description:'''<br />
<br />
Our project is an application on rainy day prediction using convolution neural network. The goal of our project is making a prediction if tomorrow is going to be a rainy day by using history data of the past week and some indicators such as temperature. We are planning to get the past weather data by Yahoo web API.<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' An application of Adapting Neural Networks for the Estimation of Treatment Effects (Shi, Blei, and Veitch 2019)<br />
'''Description:'''<br />
We will be using the methodology presented in "Adapting Neural Networks for the Estimation of Treatment Effects" by Claudia Shi, David M. Blei, and Victor Veitch and applying it to a new dataset and simulated data. This method is used to estimate treatment effects from observational data via an architecture called "Dragonnet" which uses propensity scoring for estimation adjustment and targeted regularization. This method has been shown to out-perform existing methods for benchmark datasets, and we will apply it to a new dataset (TBD) and simulated data to evaluate it's performance for classification and prediction.<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 32 Group members:''' <br />
<br />
Taohao Wang<br />
Zeren Shen<br />
Zihao Guo<br />
Rui Chen<br />
<br />
'''Title:''' Google Landmark Recognition 2020<br />
<br />
'''Description:'''<br />
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,<br />
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.<br />
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very<br />
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several <br />
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time, <br />
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 33 Group members:''' <br />
<br />
Hansa Halim<br />
<br />
Sanjana Rajendra Naik<br />
<br />
Samka Marfua<br />
<br />
Shawrupa Proshasty<br />
<br />
'''Title:''' Superhuman AI for multiplayer poker (Brown and Sandholm 2019)<br />
<br />
'''Description:'''<br />
Our team aims to recreate the paper “Superhuman AI for multiplayer poker” by Noam Brown and Tuomas Sandholm. The paper talks about algorithm used by the authors to train the AI for playing poker. They primary do so using the Monte Carlo CFR. Poker is a great example for training AI with incomplete data. Furthermore, since it is a multiplayer game, this presents more complications while training the AI. The authors use abstraction to reduce the number of different actions to be considered by the AI, information abstraction and action abstraction both.<br />
We aim to replicate this algorithm for at least 2 players to begin with.<br />
<br />
Link to paper: [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper]</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_441/841_CM_763-Proposal&diff=42790F21-STAT 441/841 CM 763-Proposal2020-10-21T20:42:20Z<p>Hhalim: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
--------------------------------------------------------------------<br />
<br />
'''Project # 1 Group members:'''<br />
<br />
Song, Quinn<br />
<br />
Loh, William<br />
<br />
Bai, Junyue<br />
<br />
Choi, Phoebe<br />
<br />
'''Title:''' APTOS 2019 Blindness Detection<br />
<br />
'''Description:'''<br />
<br />
Our team chose the APTOS 2019 Blindness Detection Challenge from Kaggle. The goal of this challenge is to build a machine learning model that detects diabetic retinopathy by screening retina images.<br />
<br />
Millions of people suffer from diabetic retinopathy, the leading cause of blindness among working-aged adults. It is caused by damage to the blood vessels of the light-sensitive tissue at the back of the eye (retina). In rural areas where medical screening is difficult to conduct, it is challenging to detect the disease efficiently. Aravind Eye Hospital hopes to utilize machine learning techniques to gain the ability to automatically screen images for disease and provide information on how severe the condition may be.<br />
<br />
Our team plans to solve this problem by applying our knowledge in image processing and classification.<br />
<br />
<br />
----<br />
<br />
'''Project # 2 Group members:'''<br />
<br />
Li, Dylan<br />
<br />
Li, Mingdao<br />
<br />
Lu, Leonie<br />
<br />
Sharman,Bharat<br />
<br />
'''Title:''' Risk prediction in life insurance industry using supervised learning algorithms<br />
<br />
'''Description:'''<br />
<br />
In this project, we aim to replicate and possibly improve upon the work of Jayabalan et al. in their paper “Risk prediction in life insurance industry using supervised learning algorithms”. We will be using the Prudential Life Insurance Data Set that the authors have used and have shared with us. We will be pre-processing the data to replace missing values, using feature selection using CFS and feature reduction using PCA use this processed data to perform Classification via four algorithms – Neural Networks, Random Tree, REPTree and Multiple Linear Regression. We will compare the performance of these Algorithms using MAE and RMSE metrics and come up with visualizations that can explain the results easily even to a non-quantitative audience. <br />
<br />
Our goal behind this project is to learn applying the algorithms that we learned in our class to an industry dataset and come up with results that we can aid better, data-driven decision making.<br />
<br />
----<br />
<br />
'''Project # 3 Group members:'''<br />
<br />
Parco, Russel<br />
<br />
Sun, Scholar<br />
<br />
Yao, Jacky<br />
<br />
Zhang, Daniel<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Lyft Motion Prediction for Autonomous Vehicles Kaggle competition. The aim of this competition is to build a model which given a set of objects on the road (pedestrians, other cars, etc), predict the future movement of these objects.<br />
<br />
Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Our aim is to apply classification techniques learned in class to optimally predict how these objects move.<br />
<br />
----<br />
<br />
'''Project # 4 Group members:'''<br />
<br />
Chow, Jonathan<br />
<br />
Dharani, Nyle<br />
<br />
Nasirov, Ildar<br />
<br />
'''Title:''' Classification with Abstinence<br />
<br />
'''Description:''' <br />
<br />
We seek to implement the algorithm described in [https://papers.nips.cc/paper/9247-deep-gamblers-learning-to-abstain-with-portfolio-theory.pdf Deep Gamblers: Learning to Abstain with Portfolio Theory]. The paper describes augmenting classification problems to include the option of abstaining from making a prediction when confidence is low.<br />
<br />
Medical imaging diagnostics is a field in which classification could assist professionals and improve life expectancy for patients through increased accuracy. However, there are also severe consequences to incorrect predictions. As such, we also hope to apply the algorithm implemented to the classification of medical images, specifically instances of normal and pneumonia [https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia? chest x-rays]. <br />
<br />
----<br />
<br />
'''Project # 5 Group members:'''<br />
<br />
Jones, Hayden<br />
<br />
Leung, Michael<br />
<br />
Haque, Bushra<br />
<br />
Mustatea, Cristian<br />
<br />
'''Title:''' Combine Convolution with Recurrent Networks for Text Classification<br />
<br />
'''Description:''' <br />
<br />
Our team chose to reproduce the paper [https://arxiv.org/pdf/2006.15795.pdf Combine Convolution with Recurrent Networks for Text Classification] on Arxiv. The goal of this paper is to combine CNN and RNN architectures in a way that more flexibly combines the output of both architectures other than simple concatenation through the use of a “neural tensor layer” for the purpose of improving at the task of text classification. In particular, the paper claims that their novel architecture excels at the following types of text classification: sentiment analysis, news categorization, and topical classification. Our team plans to recreate this paper by working in pairs of 2, one pair to implement the CNN pipeline and the other pair to implement the RNN pipeline. We will be working with Tensorflow 2, Google Collab, and reproducing the paper’s experimental results with training on the same 6 publicly available datasets found in the paper.<br />
<br />
----<br />
<br />
'''Project # 6 Group members:'''<br />
<br />
Chin, Ruixian<br />
<br />
Ong, Jason<br />
<br />
Chiew, Wen Cheen<br />
<br />
Tan, Yan Kai<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team chose to participate in a Kaggle research challenge "Mechanisms of Action (MoA) Prediction". This competition is a project within the Broad Institute of MIT and Harvard, the Laboratory for Innovation Science at Harvard (LISH), and the NIH Common Funds Library of Integrated Network-Based Cellular Signatures (LINCS), present this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
----<br />
<br />
'''Project # 7 Group members:'''<br />
<br />
Ren, Haotian <br />
<br />
Cheung, Ian Long Yat<br />
<br />
Hussain, Swaleh <br />
<br />
Zahid, Bin, Haris <br />
<br />
'''Title:''' Transaction Fraud Detection <br />
<br />
'''Description:''' <br />
<br />
Protecting people from fraudulent transactions is an important topic for all banks and internet security companies. This Kaggle project is based on the dataset from IEEE Computational Intelligence Society (IEEE-CIS). Our objective is to build a more efficient model in order to recognize each fraud transaction with a higher accuracy and higher speed.<br />
----<br />
<br />
'''Project # 8 Group members:'''<br />
<br />
ZiJie, Jiang<br />
<br />
Yawen, Wang<br />
<br />
DanMeng, Cui<br />
<br />
MingKang, Jiang<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles <br />
<br />
'''Description:'''<br />
<br />
Our team chose to participate in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". We will apply our science skills to build motion prediction models for self-driving vehicles. The model will be able to predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians. The goal of this competition is to predict the trajectories of other traffic participants.<br />
<br />
----------------------------------------------------------------------<br />
<br />
<br />
'''Project # 9 Group members:'''<br />
<br />
Banno, Dion <br />
<br />
Battista, Joseph<br />
<br />
Kahn, Solomon <br />
<br />
'''Title:''' Increasing Spotify user engagement through predictive personalization<br />
<br />
'''Description:''' <br />
<br />
Our project is an application of classification to the domain of predictive personalization. The goal of the project is to increase Spotify user engagement through data-driven methods. Given a set of users’ demographic data, listening preferences and behaviour, our goal is to build a recommendation system that suggests new songs to users. From a potential pool of songs to suggest, the final song recommendations will be driven by a classification algorithm that measures a given user’s propensity to like a song. We plan on leveraging the Spotify Web API to gather data about songs and collecting user data from consenting peers.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 10 Group members:'''<br />
<br />
Qing, Guo <br />
<br />
Wang, Yuanxin<br />
<br />
James, Ni<br />
<br />
Xueguang, Ma<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:''' <br />
<br />
Our team has decided to participate in the Mechanisms of Action (MoA) Prediction Kaggle competition. This is a challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.<br />
Our team plan to develop an algorithm to predict a compound’s MoA given its cellular signature and our goal is to learn various algorithms taught in this course.<br />
<br />
<br />
-----------------------------------------------------------------------<br />
<br />
'''Project # 11 Group members:'''<br />
<br />
Yang, Jiwon <br />
<br />
Mahdi, Anas<br />
<br />
Thibault, Will<br />
<br />
Lau, Jan<br />
<br />
'''Title:''' Application of classification in human fatigue analysis<br />
<br />
'''Description:''' <br />
<br />
The goal of this project is to classify different levels of fatigue based on motion capture (Vicon) and force plates data. First, we plan to obtain data from 4 to 6 participants performing squats or squats with weights and rate them on a fatigue scale, with each participant doing at least 50 to 100 reps. We will collect data with EMG, IMU, force plates, and Vicon. When the participants are squatting, we will ask them about their fatigue level, and compare their feedback against the fatigue level recorded on EMG. The fatigue level will be on a scale of 1 to 10 (1 being not fatigued at all and 10 being cannot continue anymore). Once data is collected, we will classify the motion capture and force plates data into the different levels of fatigue.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 12 Group members:'''<br />
<br />
Xiaolan Xu, <br />
<br />
Robin Wen, <br />
<br />
Yue Weng, <br />
<br />
Beizhen Chang<br />
<br />
'''Title:''' Identification (Classification) of Submillimetre Galaxies Based on Multiwavelength Data in Astronomy<br />
<br />
'''Description:''' <br />
<br />
Identifying the counterparts of submillimetre galaxies (SMGs) in multiwavelength images is important to the study of galaxy evolution in astronomy. However, obtaining a statistically significant sample of robust associations is very challenging because of the poor angular resolution of single-dish submm facilities, that is we can not tell which galalxy is actually responsible for the submillimeter emission from a group of possible candidates due to the poor resolution. Recently, a set of labelled dataset is obtained from ALMA, a milliemetre/submilliemetre telescope array with the sufficient resolution to pin down the exact source of submillimeter emssion. However, applying such array to large fraction of skies are not feasible, so it is of practical interest to develop algorithm to identify submillimetre galaxies (SMGs) based on the other available data. With this newly labelled dataset from ALMA, it is possible to test and develop different new alrgorithms and apply them on unlabelled data to detect submillimetre galaxies.<br />
<br />
In our work, we primarily build on the works of Liu et al.(https://arxiv.org/abs/1901.09594), which tested a set of standard classification algorithms to the dataset. We aim to first reproduce their work and test other classification algorithms with a more stastics centered perspective. Next, we hope to possibly extend their works from one or some of the following directions: (1)Incorporating some other relevant features to augment the dimensions of the available dataset for better classification rate. (2)Taking the measurement error into the classifcation algorithms, possibly from a Bayesian approach. (All features in astronomy datasets come from actual physical measurements, which come with an error bar. However, it is not clear how to incoporate this error into the classification task.) (3)The possibility of combining some tradtional astronomy approaches with algorithms from ML.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 13 Group members:'''<br />
<br />
<br />
Zihui (Betty) Qin,<br />
<br />
Wenqi (Maggie) Zhao,<br />
<br />
Muyuan Yang,<br />
<br />
Amartya (Marty) Mukherjee,<br />
<br />
'''Title:''' Insider Trading Roles Classification Prediction on United States conventional stock or non-derivative transaction<br />
<br />
'''Description:'''<br />
<br />
Background (why we were interested in classifying based on insiders): <br />
The United States is one of the most frequently traded financial markets in the world. The dataset captures all insider activities as reported on SEC (U.S. Securities and Exchange Commission) forms 3, 4, 5, and 144. We believe that using variables (such as transaction date, security type, and transaction amount), we could predict the roles code for a new transaction. The reason for the chosen prediction is that the role of the insider gives investors signals of potential internal activities and private information. This is crucial for investors to detect important market signals from those insider trading activities, such that they could benefit from the market. <br />
<br />
Goal: To classify the role of an insider in a company based on the data of their trades.<br />
<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 14 Group members:'''<br />
<br />
Jung, Kyle<br />
<br />
Kim, Dae Hyun<br />
<br />
Lee, Stan<br />
<br />
Lim, Seokho<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction Competition<br />
<br />
'''Description:''' The main objective of this Kaggle competition is to help to develop an algorithm to predict a compound's MoA given its cellular signature, helping scientists advance the drug discovery process. Our execution plan is to apply concepts and algorithms learned in STAT441 and apply multi-label classification. Through the process, our team will learn biological knowledge necessary to complete and enhance our classification thought-process. https://www.kaggle.com/c/lish-moa<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 15 Group Members:'''<br />
<br />
Li, Evan<br />
<br />
Abuaisha, Karam<br />
<br />
Vadivelu, Nicholas<br />
<br />
Pu, Jason<br />
<br />
'''Title:''' Predict Students Answering Ability Kaggle Competition<br />
<br />
'''Description:'''<br />
<br />
https://www.kaggle.com/c/riiid-test-answer-prediction<br />
We plan on tackling this Kaggle competition that revolves around classifying whether students are able to answer their next questions correctly. The data provided consists of the student’s historic performance, the performance of other students on the same question, metadata about the question itself, and more. The theme of the competition is to tailor education to a student’s ability as an AI tutor.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 16 Group members:'''<br />
<br />
Hall, Matthew<br />
<br />
Chalaturnyk, Johnathan<br />
<br />
'''Title:''' Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS<br />
<br />
'''Description:'''<br />
<br />
Predictive emission monitoring systems (PEMS) are used in conjunction with measurement instruments to predict the amount of emissions exuded from Gas turbine engines. The implementation of this system is reliant on the availability of proper measurements and ecological data points. We will attempt to adjust the novel PEMS implementation from this paper in the hopes of improving the prediction of CO and NOx emission levels from the turbines. Using data points collected over the previous five years, we'll use a number of machine learning algorithms to discuss possible future research areas. Finally, we will compare our methods against the benchmark presented in this paper in order to measure the effectiveness of our problem solutions.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 17 Group members:'''<br />
<br />
Yang, Junyi<br />
<br />
Wang, Jill Yu Chieh<br />
<br />
Wu, Yu Min<br />
<br />
Li, Calvin<br />
<br />
'''Title:''' Humpback Whale Identification<br />
<br />
'''Description:'''<br />
<br />
Our team will participate in the Kaggle challenge, Humpback Whale Identification. The main objective is to build a multi-class classification model to identify whales' class base on their tail. There are a total of over 3000 classes and 25361 training images. The challenge is that for each class, there are only on average 8 training data. <br />
<br />
------------------------------------------------------------------------<br />
'''Project # 18 Group members:''' <br />
<br />
Lian, Jinjiang <br />
<br />
Zhu, Yisheng <br />
<br />
Huang, Mingzhe <br />
<br />
Hou, Jiawen <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction <br />
<br />
'''Description:''' <br />
<br />
The final project of our team is the Kaggle ongoing competition -- Mechanism of Action(MoA) Prediction. The goal is to improve the MoA prediction algorithm to assist and advance drug development. MoA algorithm helps scientists approach more targeted medicine molecules based on the biological mechanism of disease. This would strongly shorten the medicine development cycle. Here, MoA here is to apply different drugs to human cells to analyze the corresponding reaction and the dataset provides simultaneous measurement of 100 types of human cells and 5000 drugs. <br />
<br />
To tackle this competition, after data cleaning and feature engineering, we are going to try a selection of ML algorithms such as logistic regression, tree-based method, SVM, etc and find the optimized one that can best complete the tasks. Depending on how we perform, we might utilize other technics such as model ensembling or stacking.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 19 Group members:''' <br />
<br />
Fagan, Daniel <br />
<br />
Brooke, Cooper <br />
<br />
Perelman, Maya <br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction (https://www.kaggle.com/c/lish-moa/overview/description)<br />
<br />
'''Description:''' <br />
<br />
For our final project, we will be competing in the Mechanisms of Action (MoA) Prediction Research Challenge on Kaggle. MoA refers to the description of the biological activity of a given molecule and scientists have specific interest in the MoA of molecules as it pertains to the advancement of drugs. This is because under new frameworks, scientists are looking to develop molecules that can modulate protein targets associated with given diseases. Our task will be to analyze a dataset containing human cellular responses to more than 5, 000 drugs and to classify these responses with one or more MoA.<br />
<br />
For this competition, we plan to use various classification algorithms taught in STAT 441 followed by model validation techniques to ultimately select the most accurate model based on the logarithmic loss function which was specified by Kaggle.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 20 Group members:''' <br />
Cheng, Leyan<br />
<br />
Dai, Mingyan<br />
<br />
Jiang, Daniel <br />
<br />
Huang, Jerry<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
We will be competing in the Riiid! Kaggle Challenge. The goal of this challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.<br />
<br />
We plan on using the classification techniques and model validation techniques learned in the course in order to design an algorithm that can accurately predict the actions of students.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 21 Group members:''' <br />
<br />
Carson, Emilee<br />
<br />
Ellmen, Isaac<br />
<br />
Mohammadrezaei, Dorsa<br />
<br />
Budaraju, Sai Arvind<br />
<br />
<br />
'''Title:''' Classifying SARS-CoV-2 region of origin based on DNA/RNA sequence<br />
<br />
'''Description:'''<br />
<br />
Determining the location of origin for a viral sequence is an important tool for epidemiological tracking. Knowing where a virus comes from allows epidemiologists to track how a virus is spreading. There are significant efforts to track the spread of SARS-CoV-2. As an RNA virus, SARS-CoV-2 mutates frequently. Most of these mutations carry negligible changes to the function of the virus but act as “barcodes” for specific strains. As the virus spreads in a region, it picks up mutations which allow researchers to identify which sequences correspond to which regions.<br />
<br />
The standard method for classifying viruses based on location is to:<br />
<br />
- Perform a multiple sequence alignment (MSA)<br />
<br />
- Build a phylogenetic tree of the MSA<br />
<br />
- Empirically determine which regions have which sections of the tree<br />
<br />
Phylogenetic trees are an excellent tool for tracking evolutionary changes over time but we wonder if there are better methods for classifying the region of origin for a virus using machine learning techniques.<br />
<br />
Our plan is to perform PCA on the MSA which is available through GISAID. We will determine an appropriate encoding for sequence alignments to vectors and map the aligned sequences onto a much lower dimensional space. We will then use LDA or QDA to classify points based on region (continent). We will also examine if the same technique works well for classifying sequences based on state of origin for samples from the United States. We may try other classification techniques such as logistic regression or neural nets. Finally, we know that projecting data to a small number of principal components and then projecting back to the original space can reduce noise in certain datasets. In the case of mutations, this might correspond to removing insignificant mutations. It is possible that there are certain mutations which induce functional changes in the virus which would be of greater medical interest. Our hope is that we could detect these using PCA.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 22 Group members:''' <br />
<br />
Chang, Luwen<br />
<br />
Yu, Qingyang<br />
<br />
Kong, Tao <br />
<br />
Sun, Tianrong<br />
<br />
'''Title:''' Riiid! Answer Correctness Prediction<br />
<br />
'''Description:'''<br />
<br />
For the final project, we chose the featured Kaggle Competition named Riiid! Answer Correctness Prediction. The purpose of this challenge is to build a machine learning model to predict the students' interaction performance. (https://www.kaggle.com/c/riiid-test-answer-prediction)<br />
<br />
We plan to use classification and regression techniques learned in this course to build the model and use area under ROC curve to evaluate our model.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 23 Group members:''' <br />
<br />
Han, Jihoon<br />
<br />
Vera De Casey<br />
<br />
Jawad Solaiman<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We are planning to compete in the Lyft Motion Prediction for Autonomous Vehicles Challenge on Kaggle. Our goal is to build a motion prediction model for the self-driving car by using our machine learning knowledge as well as utilizing the training and testing data sets. The motion prediction model will predict the motion of traffic agents around the car, such as cars, cyclists, and pedestrians. We are not sure if we have to classify the agents into three categories (cars, cyclists, pedestrians) ourselves. If so, we will initially start by using the single-shot detector algorithm and improve through it.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 24 Group members:''' <br />
<br />
Guanting Pan<br />
<br />
Haocheng Chang <br />
<br />
Zaiwei Zhang<br />
<br />
'''Title:''' Reproducing result in Accelerated Stochastic Power Iteration<br />
<br />
'''Description:'''<br />
<br />
As our final project, we will reproduce the stochastic PCA algorithm designed by De Sa, He, Mitliagkas, Ré, and Xu to accelerate the iteration complexity for power iteration. By doing so, we are aiming to achieve a final rate of 𝒪(1/sqrt(Δ)) for our reproduction result. We are also hoping to explore and discuss the potentiality for applying such an acceleration method to other non-convex optimization problems, as mentioned in the original paper if there is additional time to do so. Link to the paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6557638/pdf/nihms-993807.pdf<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 25 Group members:''' <br />
<br />
Haoran Dong<br />
<br />
Mushi Wang<br />
<br />
Siyuan Qiu<br />
<br />
Yan Yu<br />
<br />
'''Title:''' Lyft Motion Prediction for Autonomous Vehicles<br />
<br />
'''Description:'''<br />
<br />
We want to be involved in the Kaggle Challenge "Lyft Motion Prediction for Autonomous Vehicles". The goal is to build a motion prediction model for the self-driving car by machine learning with the datasets they provided.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 26 Group members:''' <br />
<br />
Sangeeth Kalaichanthiran<br />
<br />
Evan Peters<br />
<br />
Cynthia Mou<br />
<br />
Yuxin Wang<br />
<br />
'''Title:''' Mechanisms of Action (MoA) Prediction<br />
<br />
'''Description:'''<br />
<br />
Our team chose the "Mechanisms of Action (MoA) Prediction" challenge on Kaggle. Mechanisms of Action, MOA for short, describes the biological response of human cells to a particular molecule (the drug). The goal is to develop an algorithm that can predict the biological response of a drug based on its similarities to other known drugs. <br />
<br />
Our team hopes to develop a superior algorithm by using our knowledge of supervised learning methods.<br />
<br />
------------------------------------------------------------------------<br />
'''Project # 27 Group members:''' <br />
<br />
Delaney Smith<br />
<br />
Mohammad Assem Mahmoud<br />
<br />
'''Title:''' Replicating "Electrocardiogram heartbeat classification based on a deep convolutional<br />
neural network and focal loss"<br />
<br />
'''Description:'''<br />
<br />
For our project, we intend to replicate and hopefully, extend the work of Romdhane et al.’s 2020 paper “Electrocardiogram heartbeat classification based on a deep convolutional neural network and focal loss”. In this paper, the authors develop a deep convoluted neural network that exploits a novel loss function, focal loss, to classify heartbeats into five arrhythmia categories (N, S, V, Q and F) based on the AAMI standard. The network was trained and tested against two ECG datasets, MIT-BIH and INCART, and returned a 98.41% overall accuracy, a 98.38% overall F1-score, a 98.37% overall prevision and a 98.41% overall recall, which we intend to replicate. <br />
Interestingly, focal loss was implemented to prevent bias towards larger classes (normal heart beats) without needing to augment the smaller class data (diseased heart beats), however the authors did not outline which method actually performs better. Therefore, we hope to extend their work by answering this question in this project.<br />
------------------------------------------------------------------------<br />
'''Project # 28 Group members:''' <br />
<br />
Fang Yuqin<br />
<br />
Fu Rao<br />
<br />
Li Siqi<br />
<br />
Zhou Zeping<br />
<br />
'''Title:''' The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network<br />
<br />
'''Description:'''<br />
Our group aims to dig more on single hidden layer neural network based on what we have learned from class. We'll focus on data that follows the Gaussian distribution and weights such that we can provide some expression in terms of the spectrum in the limit of infinite width. We believe that we can improve the efficiency of first-order optimization problems by applying spectrun. <br />
------------------------------------------------------------------------<br />
'''Project # 29 Group members:''' <br />
<br />
Rui Gong<br />
<br />
Xuetong Wang<br />
<br />
Xinqi Ling<br />
<br />
Di Ma<br />
<br />
'''Title:''' Convolution Neural Network for Rainy day Prediction<br />
<br />
'''Description:'''<br />
<br />
Our project is an application on rainy day prediction using convolution neural network. The goal of our project is making a prediction if tomorrow is going to be a rainy day by using history data of the past week and some indicators such as temperature. We are planning to get the past weather data by Yahoo web API.<br />
------------------------------------------------------------------------<br />
'''Project # 30 Group members:''' <br />
<br />
Jiabao Dong<br />
<br />
Jiaxiang Liu<br />
<br />
Siyuan Xia<br />
<br />
Yipeng Du<br />
<br />
'''Title:''' Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation<br />
<br />
'''Description:'''<br />
We aim to replicate the work demonstrated in [https://papers.nips.cc/paper/8632-privacy-preserving-classification-of-personal-text-messages-with-secure-multi-party-computation.pdf Privacy-Preserving Classification of Personal Text Messages with Secure Multi-Party Computation]. <br />
<br />
Personal text classification has many useful applications such as mental health care and security surveillance, but also raises concerns about personal privacy. The method proposed in this paper is based on Secure Multiparty Computation (SMC) and avoids (un)intentional privacy violations. The method then extracts features from texts and classifies with logistic regression and tree ensembles. This paper claims to have proposed the first privacy-preserving (PP) solution for text classification that is provably secure.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 31 Group members:''' <br />
<br />
Tompkins, Grace<br />
<br />
Krikella, Tatiana<br />
<br />
'''Title:''' An application of Adapting Neural Networks for the Estimation of Treatment Effects (Shi, Blei, and Veitch 2019)<br />
'''Description:'''<br />
We will be using the methodology presented in "Adapting Neural Networks for the Estimation of Treatment Effects" by Claudia Shi, David M. Blei, and Victor Veitch and applying it to a new dataset and simulated data. This method is used to estimate treatment effects from observational data via an architecture called "Dragonnet" which uses propensity scoring for estimation adjustment and targeted regularization. This method has been shown to out-perform existing methods for benchmark datasets, and we will apply it to a new dataset (TBD) and simulated data to evaluate it's performance for classification and prediction.<br />
<br />
We will use R for analysis.<br />
<br />
Link to paper: [http://papers.nips.cc/paper/8520-adapting-neural-networks-for-the-estimation-of-treatment-effects]<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 32 Group members:''' <br />
<br />
Taohao Wang<br />
Zeren Shen<br />
Zihao Guo<br />
Rui Chen<br />
<br />
'''Title:''' Google Landmark Recognition 2020<br />
<br />
'''Description:'''<br />
Our team decided to give a try for "Google Landmark Recognition 2020" (kaggle) competition,<br />
in which the competitors are asked to build a model to detect any existing landmarks within provided test images.<br />
This competition is challenging in its own way: it has more than 81K classes within its data, where traditional CNN would very<br />
likely to fail(too many parameters to train, especially when taking convolutional layers into account). We will like to implement several <br />
algorithms/frameworks which can utilize a large amount of data with noisy labels, apply them to the provided dataset, and compare their performance(training time, <br />
number of parameters trained, multiple metrics for accuracy/loss evaluation... etc) for our report.<br />
<br />
------------------------------------------------------------------------<br />
<br />
'''Project # 33 Group members:''' <br />
<br />
Hansa Halim<br />
<br />
Sanjana Rajendra Naik<br />
<br />
Samka Marfua<br />
<br />
Shawrupa Proshasty<br />
<br />
'''Title:''' Superhuman AI for multiplayer poker (Brown and Sandholm 2019)<br />
'''Description:'''<br />
Our team aims to recreate the paper “Superhuman AI for multiplayer poker” by Noam Brown and Tuomas Sandholm. The paper talks about algorithm used by the authors to train the AI for playing poker. They primary do so using the Monte Carlo CFR. Poker is a great example for training AI with incomplete data. Furthermore, since it is a multiplayer game, this presents more complications while training the AI. The authors use abstraction to reduce the number of different actions to be considered by the AI, information abstraction and action abstraction both.<br />
We aim to replicate this algorithm for at least 2 players to begin with.<br />
<br />
Link to paper: [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf]</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=42605stat441F212020-10-07T13:19:43Z<p>Hhalim: /* Paper presentation */</p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="250pt"|Name <br />
|width="15pt"|Paper number <br />
|width="700pt"|Title<br />
|width="15pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 ||Sharman Bharat, Li Dylan,Lu Leonie, Li Mingdao || 1|| Risk prediction in life insurance industry using supervised learning algorithms || [https://rdcu.be/b780J Paper] ||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Bsharman Summary] ||<br />
|-<br />
|Week of Nov 16 || Delaney Smith, Mohammad Assem Mahmoud || 2|| Influenza Forecasting Framework based on Gaussian Processes || [https://proceedings.icml.cc/static/paper_files/icml/2020/1239-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 16 || Tatianna Krikella, Swaleh Hussain, Grace Tompkins || 3|| Processing of Missing Data by Neural Networks || [http://papers.nips.cc/paper/7537-processing-of-missing-data-by-neural-networks] || ||<br />
|-<br />
|Week of Nov 16 ||Jonathan Chow, Nyle Dharani, Ildar Nasirov ||4 ||Streaming Bayesian Inference for Crowdsourced Classification ||[https://papers.nips.cc/paper/9439-streaming-bayesian-inference-for-crowdsourced-classification.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || || 5|| || || ||<br />
|-<br />
|Week of Nov 16 || || 6|| || || ||<br />
|-<br />
|Week of Nov 16 || || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || || 8|| || || ||<br />
|-<br />
|Week of Nov 16 || || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || || 10|| || || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || Bushra Haque, Hayden Jones, Michael Leung, Cristian Mustatea || 12|| Combine Convolution with Recurrent Netorks for Text Classification || [https://arxiv.org/pdf/2006.15795.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || || 13|| || || ||<br />
|-<br />
|Week of Nov 23 || Qianlin Song, William Loh, Junyue Bai, Phoebe Choi || 14|| Task Understanding from Confusing Multi-task Data || [https://proceedings.icml.cc/static/paper_files/icml/2020/578-Paper.pdf paper] || ||<br />
|-<br />
|Week of Nov 23 || || 15|| || || ||<br />
|-<br />
|Week of Nov 23 || Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang || 16|| || || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Superhuman AI for multiplayer poker || [https://www.cs.cmu.edu/~noamb/papers/19-Science-Superhuman.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Guanting Pan, Haocheng Chang, Zaiwei Zhang || 18|| Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence || [https://arxiv.org/pdf/1809.10770.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Jerry Huang, Daniel Jiang, Minyan Dai, Leyan Cheng || 19|| Neural Speed Reading Via Skim-RNN ||[https://arxiv.org/pdf/1711.02085.pdf?fbclid=IwAR3EeFsKM_b5p9Ox7X9mH-1oI3U3oOKPBy3xUOBN0XvJa7QW2ZeJJ9ypQVo Paper] || ||<br />
|-<br />
|Week of Nov 23 ||Ruixian Chin, Yan Kai Tan, Jason Ong, Wen Cheen Chiew || 20|| DivideMix: Learning with Noisy Labels as Semi-supervised Learning || [https://openreview.net/pdf?id=HJgExaVtwr] || ||<br />
|-<br />
|Week of Nov 30 || || 21|| || || ||<br />
|-<br />
|Week of Nov 30 || || 22|| || || ||<br />
|-<br />
|Week of Nov 30 || || 23|| || || ||<br />
|-<br />
|Week of Nov 30 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi Will Thibault Jan Lau Jiwon Yang || 25|| Loss Function Search for Face Recognition<br />
|| [https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf] paper || ||<br />
|-<br />
|Week of Nov 30 || || 26|| || || ||<br />
|-<br />
|Week of Nov 30 || || 27|| || || ||<br />
|-<br />
|Week of Nov 30 || || 28|| || || ||<br />
|-<br />
|Week of Nov 30 || || 29|| || || ||<br />
|-<br />
|Week of Nov 30 || Bertrand Sodjahin, Junyi Yang, Jill Yu Chieh Wang, Yu Min Wu, Calvin Li || 30|| Research paper classifcation systems based on TF‑IDF and LDA schemes || [https://hcis-journal.springeropen.com/articles/10.1186/s13673-019-0192-7?fbclid=IwAR3swO-eFrEbj1BUQfmomJazxxeFR6SPgr6gKayhs38Y7aBG-zX1G3XWYRM Paper] || ||<br />
|-</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat441F21&diff=42513stat441F212020-10-06T16:01:02Z<p>Hhalim: </p>
<hr />
<div><br />
<br />
== [[F20-STAT 441/841 CM 763-Proposal| Project Proposal ]] ==<br />
<br />
<!--[https://goo.gl/forms/apurag4dr9kSR76X2 Your feedback on presentations]--><br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/10CHiJpAylR6kB9QLqN7lZHN79D9YEEW6CDTH27eAhbQ/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 16 || || 1|| || || ||<br />
|-<br />
|Week of Nov 16 || || 2|| || || ||<br />
|-<br />
|Week of Nov 16 || || 3|| || || ||<br />
|-<br />
|Week of Nov 16 || || 4|| || || ||<br />
|-<br />
|Week of Nov 16 || || 5|| || || ||<br />
|-<br />
|Week of Nov 16 || || 6|| || || ||<br />
|-<br />
|Week of Nov 16 || || 7|| || || ||<br />
|-<br />
|Week of Nov 16 || || 8|| || || ||<br />
|-<br />
|Week of Nov 16 || || 9|| || || ||<br />
|-<br />
|Week of Nov 16 || || 10|| || || ||<br />
|-<br />
|Week of Nov 23 ||Jinjiang Lian, Jiawen Hou, Yisheng Zhu, Mingzhe Huang || 11|| DROCC: Deep Robust One-Class Classification || paper[https://proceedings.icml.cc/static/paper_files/icml/2020/6556-Paper.pdf] || ||<br />
|-<br />
|Week of Nov 23 || || 12|| || || ||<br />
|-<br />
|Week of Nov 23 || || 13|| || || ||<br />
|-<br />
|Week of Nov 23 || || 14|| || || ||<br />
|-<br />
|Week of Nov 23 || || 15|| || || ||<br />
|-<br />
|Week of Nov 23 || || 16|| || || ||<br />
|-<br />
|Week of Nov 23 ||Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty || 17|| Emergent Tool Use From Multi-Agent Autocurricula || [https://arxiv.org/pdf/1909.07528.pdf] || ||<br />
|-<br />
|Week of Nov 23 || || 18|| || || ||<br />
|-<br />
|Week of Nov 23 || || 19|| || || ||<br />
|-<br />
|Week of Nov 23 || || 20|| || || ||<br />
|-<br />
|Week of Nov 30 || || 21|| || || ||<br />
|-<br />
|Week of Nov 30 || || 22|| || || ||<br />
|-<br />
|Week of Nov 30 || || 23|| || || ||<br />
|-<br />
|Week of Nov 30 || || 24|| || || ||<br />
|-<br />
|Week of Nov 30 || Anas Mahdi || 25|| Loss Function Search for Face Recognition<br />
|| https://proceedings.icml.cc/static/paper_files/icml/2020/245-Paper.pdf || ||<br />
|-<br />
|Week of Nov 30 || || 26|| || || ||<br />
|-<br />
|Week of Nov 30 || || 27|| || || ||<br />
|-<br />
|Week of Nov 30 || || 28|| || || ||<br />
|-<br />
|Week of Nov 30 || || 29|| || || ||<br />
|-<br />
|Week of Nov 30 || || 30|| || || ||<br />
|-</div>Hhalim