http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Special:NewPages&feed=atom&hideredirs=1&limit=50&offset=&namespace=0&username=&tagfilter=statwiki - New pages [US]2021-05-07T19:33:53ZFrom statwikiMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=This_Looks_Like_That:_Deep_Learning_for_Interpretable_Image_RecognitionThis Looks Like That: Deep Learning for Interpretable Image Recognition2020-12-06T00:21:53Z<p>A2chanan: /* Results */</p>
<hr />
<div>== Presented by ==<br />
Nouha Chatti <br />
<br />
== Introduction ==<br />
The motivation behind this paper is to introduce a new deep learning network architecture capable of reasoning in an interpretable manner during classification tasks. The goal of the algorithm is to utilize human-understandable reasoning to perform image classification tasks.<br />
<br />
The idea is to perform these tasks by defining a form of interpretability when processing the images. The method suggested in this paper consists of dissecting parts of the input images and comparing them to prototypical parts of training images of a given class, thus the expression "this looks like that". Interpretability is crucial in many problems that require understanding how a model makes a particular prediction. Neural networks are typically seen as some of the least interpretable models in machine learning [4]. Medical imaging is one instance where interpretability is critical, where diagnosis using X-ray scans is based on comparing to other prototypical scans. [1]<br />
<br />
== Previous Work ==<br />
Interpretability in deep networks has been a long-sought goal and is a growing field of research. There already exists post-hoc interpretability methods that analyze the performance of a trained CNN, such as [2], but this type of analysis does not explain the reasoning process of how a network actually makes its decisions during classification but are rather created after this phase. There are also attention-based models that determine parts of the input they are looking at but without associating them to prototypical samples [3].<br />
<br />
== Network Architecture ==<br />
The figure below represents ProtoPNet architecture. The first layers of this network consists of commonly used convolutional layers <math>f</math>, whose parameters are denoted <math>w_{conv}</math>. The layers used in this study are from the following known models '''VGG-16, VGG-19, ResNet-34, ResNet-152, DenseNet-121, and DenseNet-161''' previously pre-trained on ImageNet which are followed by two additional 1 × 1 convolutional layers. A layer called prototype <math>g_p</math> is a fully connected layer h with weight <math>w_h</math> and no bias that returns the output prediction using a softmax function, unlike all the rest of the layers that use ReLU as the activation function. This network takes in an image <math>x</math> which is propagated through the convolutional layers (<math>f</math> of shape <math>H x W x D</math>) where features are extracted and learns the prototypes P of shape (<math>1 x 1 x D</math>). The number of prototypes <math>m_k</math> is pre-defined for each class <math>k</math> (10 per class in this study). Each prototype will be used to represent a pattern in a patch of the conv output, corresponding to some prototypical image patch in the original pixel space. So given an output <math>z = f(x)</math>, the j-th prototype unit <math>g_{p_{j}}</math> in the prototype layer <math>g_p</math> computes the squared L2 distances between the j-th prototype <math>p_j</math> and all patches of <math>z</math> that have the same shape as <math>p_j</math> and returns the similarity scores. These score values indicate the presence of the prototypical part in the image, while preserving the spatial relation of <math>z</math>. It is possible to up-sample it to the original size in order to obtain a heatmap with different parts that are most similar to the compared prototypes. The scores given by each unit are produced using max-pooling to obtain a single score of how strong a prototypical pattern is present in the specific patch of the input, which are then multiplied by the weight matrix <math>w_h</math> in <math>h</math> to produce the output logits as shown in Figure 1.<br />
[[File:netarch.jpg|1200px|center]]<br />
<div align="center">Figure 1 : Prototypical Part Network Architecture</div><br />
<br />
== Training Algorithm ==<br />
The network is trained in three stages. First, stochastic gradient descent (SGD) of all but the last layers, followed by projection of the prototypes, and lastly convex optimization. In the initial stage the model identifies the most significant patches for the classification task and distinguishes between the prototypes of the images' true classes and those that are from different classes. SGD is used to optimize the parameters from the convolution layers and the prototypes of the prototype layer while fixing the weights of the fully connected layer in order to make the network learn to decrease the predicted probability when a part of an image of a given class is similar to a prototype from a different class. As for the second stage the aim is to visualize and associate each prototype with the most similar training image patch using the following update for every prototype of a class k:<br />
<math> P_j = \underset{z\ in Z_j}{\operatorname{arg\,min}} \lVert{z -p_j}\rVert_2 \quad\textrm{where}\quad Z_j = \{z:z \in \quad\textrm{patches} (f(x_i)) \forall i \quad\textrm{s.t}\quad y_i=k \} </math><br />
During this stage, associating a patch of the training image x to its corresponding prototype p is done as a result of the activation. The patch of x that is selected is the one that p activates the most given the activation map of x by p.<br />
In the last training stage, convex optimization is applied on the last layer while fixing parameters of previous layers, to improve accuracy by adding sparsity to the model. In other words, it prevents the model from classifying an image to a particular class because it does not have prototypes from other classes.<br />
The optimization problem that they try to solve is:<br />
[[File:CaptureDL.PNG|600px|center]]<br />
<br />
<br />
== Datasets ==<br />
The datasets that were used in this study are CUB-200-2011 representing images of 200 bird species as well as the Stanford Cars dataset with 196 car models. Data augmentation techniques were applied to enlarge both training datasets. The following are two examples of the classification task process of images from both datasets and the process of decision making.<br />
<br />
'''Examples of reasoning process:''' <br />
As it is shown in the figure below, given the testing image, the model first compares it to all learned prototypes (from all classes), looking to find proof to the image belonging to a certain class k by using the prototypes of class k. The comparison returns the similarity scores with each prototype pi and looks for the part of the image that is the most activated by pi. These scores are weighted and summed to correctly classify the testing image.<br />
[[File:exp1.jpg|1200px|center]]<br />
<div align="center">Figure 2 : Classifying an image of specific car model </div><br />
[[File:exp2.jpg|1200px|center]]<br />
<div align="center">Figure 3 : Predicting the specie of a bird </div><br />
<br />
== Results ==<br />
The results obtained using ProtoPNet on bird images as well as the car models are compared to the baseline models as well as attention-based deep models that were trained on the same datasets that ProtoPNet was trained on. ProtoPNet accuracy results are very close and as good as the non-interpretable baselines as shown in the tables below. <br />
[[File:table1protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 4 : Accuracy comparison of ProtoPNet with baseline models and other deep models on bird species dataset </div><br><br />
<br />
[[File:table2protoPNet.jpg|800px|center]]<br />
<div align="center">Figure 5 : Accuracy comparison of ProtoPNet with baseline models on car dataset </div><br><br />
<br />
Another experience of combining many protoPNet models shows an improvement of the accuracy while preserving the transparency of the decision making process. The paper implemented the model with similar architecture as ALL-CNN-V network and obtains a prediction rate 89.30% in cifar-10 dataset.<br />
<br />
== Conclusion ==<br />
The aim of constructing the ProtoPNet network was to introduce the interpretability property to neural networks. It is able to dissect images to find prototypical parts. The predictions of an image are made based on a comparison of parts of this image and learned prototypes of each class. One of the greatest advantages of ProtoPNet is that it allows the user to observe the process of how the model is making predictions and therefore understands the reasoning in case of misclassification errors. However, one disadvantage of this network is the addition of another hyperparameter in the form of the number of prototypes.<br />
<br />
== Critique ==<br />
I think that this is a really interesting approach to provide insights as to why a neural network made a certain prediction. Intuitively, based on the architecture, it seems that each convolutional layer learns a certain "aspect" of the image (ie. wheel of a car, the beak of the bird, etc). It would be interesting to see how much further one can take this idea, especially in classifying images of things that appear very similar to the human eye (i.e. various insects).<br />
<br />
== Source code ==<br />
The code for this paper is available at [https://github.com/cfchen-duke/ProtoPNet https://github.com/cfchen-duke/ProtoPNet]<br />
<br />
== Refrences == <br />
[1] C. Chen, O. Li, A. Barnett, J. Su, C. Rudin, This looks like that: deep<br />
learning for interpretable image recognition, arXiv preprint,<br />
arXiv:1806.10574, 2018.<br />
<br />
[1] A. Holt, I. Bichindaritz, R. Schmidt, and P. Perner. Medical applications in case-based reasoning. The<br />
Knowledge Engineering Review, 20:289–292, 09 2005.<br />
<br />
[2] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image<br />
Classification Models and Saliency Maps. In Workshop at the 2nd International Conference on Learning<br />
Representations (ICLR Workshop), 2014<br />
<br />
[3] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative<br />
Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),<br />
pages 2921–2929. IEEE, 2016<br />
<br />
[4] Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019. https://christophm.github.io/interpretable-ml-book/</div>N2chattihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_NetworksTime-series Generative Adversarial Networks2020-12-02T01:07:31Z<p>P2torabi: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models. GANs are being used to generate time-series data in different sectors such as healthcare, energy and finance. Researchers have conditioned them for various use cases, from handling irregular sampling <sup>[13]</sup> to building probabilistic forecasting models for univariate time-series.<sup>[14]</sup><br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches, unsupervised GANs and supervised autoregressive, that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs along with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
<div align="center"> [[File:Archs.PNG]] </div><br />
<br />
Figure 2 shows the TimeGAN architecture and where each of the loss terms are derived. It also shows the architectures of C-RNN-GAN and RCGAN, whose performance will be compared to TimeGAN in the results.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). The authors' results show that TimeGAN is insensitive to the two parameters <math>\lambda </math> and <math> \eta </math> so they decided to set <math> \lambda = 1 </math> and <math> \eta=10 </math> for all experiments. Compared to normal GAN, TimeGAN does not increase the difficulty in training.<br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses multiple datasets in different areas like Sines, Stocks, Energy, and Events with different methods(TimeGAN, RCGANM CRANNGANM, and etc), and visualize their performance between original data and synthetic. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
We could use traditional models like ARMA, ARIMA, and etc to analyze time-series type data. Also, the longitudinal data could be predicted by the generated additive model, linear mixed affected model to do both the feature selection and independent variable predicting. The author of this paper introduced an advanced architecture to generate new time-series data that combines the flexibility of unsupervised learning, and the control of the quality of the generated data by supervised learning.<br />
<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce real-time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analyzing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models, in general, is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.<br />
<br />
[12] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting 36.1 (2020): 54-74.<br />
<br />
[13] Ramponi, Giorgia and Protopapas, Pavlos and Brambilla, Marco and Janssen, Ryan. (2018) “T-cgan: Conditional generative adversarial network for data augmentation in noisy time series with irregular sampling.” arXiv preprint arXiv:1811.08295.<br />
<br />
[14] Koochali, A., Schichtel, P., Dengel, A., Ahmed, S.: Probabilistic forecasting of sensory data with generative adversarial networks forgan. IEEE Access 7, 63868–63880 (2019)</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=time-series-using-GANtime-series-using-GAN2020-12-02T01:02:53Z<p>G45sharm: Presented By</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Evaluating_Machine_Accuracy_on_ImageNetEvaluating Machine Accuracy on ImageNet2020-11-30T05:23:58Z<p>J27ni: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Siyuan Xia, Jiaxiang Liu, Jiabao Dong, Yipeng Du<br />
<br />
== Introduction == <br />
ImageNet is one of the most influential dataset in machine learning with a million human annotated images and corresponding labels for over 1000 classes. This paper intends to explore the causes for performance differences between human experts and machine learning models, more specifically, CNN, on ImageNet. <br />
<br />
Firstly, some images could belong to multiple classes. As a result, it is possible to underestimate the performance if we assign each image with only one label, which is what is being done in the top-1 metric. On the other hand, the top-5 metric looks at the top five predictions by the model for an image and checks if the target label is within those five predictions (Krizhevsky, Sutskever, & Hinton). Therefore, we adopt both top-1 and top-5 metrics where the performances of models, unlike human labelers, are linearly correlated in both cases.<br />
<br />
Secondly, in contrast to the uniform performance of models in classes, humans tend to achieve better performances on inanimate objects. Human labelers achieve similar overall accuracies as the models, which indicates spaces of improvements on specific classes for machines.<br />
<br />
Lastly, the setup of drawing training and test sets from the same distribution may favor models over human labelers. That is, the accuracy of multi-class prediction from models drops when the testing set is drawn from a different distribution than the training set, ImageNetV2. But this shift in distribution does not cause a problem for human labelers.<br />
<br />
== Experiment Setup ==<br />
=== Overview ===<br />
There are four main phases to the experiment, which are (i) initial multilabel annotation, (ii) human labeler training, (iii) human labeler evaluation, and (iv) final annotation overview. The five authors of the paper are the participants in the experiments. <br />
<br />
A brief overview of the four phases is as follows:<br />
[[File:Experiment Set Up.png |800px| center]]<br />
<br />
=== Initial multi-label annotation ===<br />
Three labelers A, B, and C provided multi-label annotations for a subset of size 20,000 from the ImageNet validation set, and all 20,683 images from the ImageNetV2 test sets. These experiences give A, B, and C extensive experience with the ImageNet dataset.<br />
<br />
=== Human Labeler Training === <br />
All five labelers trained on labeling a subset of the remaining ImageNet images. "Training" the human labelers consisted of teaching the humans the distinctions between very similar classes in the training set. For example, there are 118 classes of "dog" within ImageNet and typical human participants will not have working knowledge of the names of each breed of dog seen even if they can recognize and distinguish that breed from others. Local members of the American Kennel Club were even contacted to help with dog breed classification. To do this labelers were trained on class-specific tasks for groups like dogs, insects, monkeys beaver and others. They were also given immediate feedback on whether they were correct and then were asked where they thought they needed more training to improve. Unlike the two annotators in (Russakovsky et al., 2015), who had insufficient training data, the labelers in this experiment had up to 100 training images per class while labeling. This allowed the labelers to really understand the finer details of each class.<br />
<br />
=== Human Labeler Evaluation ===<br />
Class-balanced random samples, which contain 1,000 images from the 20,000 annotated images are generated from both the ImageNet validation set and ImageNetV2. Five participants labeled these images over 28 days.<br />
<br />
=== Final annotation Review ===<br />
All labelers reviewed the additional annotations generated in the human labeler evaluation phase.<br />
<br />
== Multi-label annotations==<br />
[[File:Categories Multilabel.png|800px|center]]<br />
<div align="center">Figure 3</div><br />
<br />
===Top-1 accuracy===<br />
With Top-1 accuracy being the standard accuracy measure used in classification studies, it measures the proportions of examples for which the predicted label matches the single target label. As many images often contain more than one object for classification, for example, Figure 3a contains a desk, laptop, keyboard, space bar, and more. With Figure 3b showing a centered prominent figure yet labeled otherwise (people vs picket fence), it can be seen how a single target label is inaccurate for such a task since identifying the main objects in the image does not suffice due to its overly stringent and punishes predictions that are the main image yet does not match its label.<br />
===Top-5 accuracy===<br />
With Top-5 considers a classification correct if the object label is in the top 5 predicted labels. Although it partially resolves the problem with Top-1 labeling, it is still not ideal since it can trivialize class distinctions. For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations.<br />
<br />
===Multi-label accuracy===<br />
The paper then proposes that for every image, the image shall have a set of target labels and a prediction; if such prediction matches one of the labels, it will be considered as correct labeling. Due to the above-discussed limitations of Top-1 and Top-5 metrics, the paper claims it is necessary for rigorous accuracy evaluation on the dataset. <br />
<br />
===Types of Multi-label annotations===<br />
====Multiple objects or organisms====<br />
For the images containing more than one object or organism that corresponds to ImageNet, the paper proposed to add an additional target label for each entity in the image. With the discussed image in Figure 3b, the class groom, bow tie, suit, gown, and hoopskirt are all present in the foreground which is then subsequently added to the set of labels.<br />
====Synonym or subset relations====<br />
For similar classes, the paper considers them as under the same bigger class, that is, for two similarly labeled images, classification is considered correct if the produced label matches either one of the labels. For instance, warthog, African elephant, and Indian element all have prominent tusks, they will be considered subclasses of the tusker, Figure 3c shows a modification of labels to contain tusker as a correct label.<br />
====Unclear Image====<br />
In certain cases such as Figure 3d, there is a distinctive difficulty to determine whether a label was correct due to ambiguities in the class hierarchy, such as differentiating between a lakeshore or a seashore.<br />
<br />
===Collecting multi-label annotations===<br />
Participants reviewed all predictions made by the models on the dataset ImageNet and ImageNet-V2, the participants then categorized every unique prediction made by the models on the dataset into correct and incorrect labels in order to allow all images to have multiple correct labels to satisfy the above-listed method.<br />
===The multi-label accuracy metric===<br />
One prediction is only correct if and only if it was marked correct by the expert reviewers during the annotation stage. As discussed in the experiment setup section, after human labelers have completed labeling, a second annotation stage is conducted. In Figure 4, a comparison of Top-1, Top-5, and multi-label accuracies showed higher Top-1 and Top-5 accuracy corresponds with higher multi-label accuracy as expected. With multi-label accuracies measures consistently higher than Top-1 yet lower than Top-5 which shows a high correlation between the three metrics, the paper concludes that multi-label metrics measures a semantically more meaningful notion of accuracy compared to its counterparts.<br />
<br />
== Human Accuracy Measurement Process ==<br />
=== Bias Control ===<br />
Since three participants participated in the initial round of annotation, they did not look at the data for six months to forget about the information that they might have unintentionally momorized; also, two additional annotators are introduced in the final evaluation phase to ensure fairness of the experiment. <br />
<br />
=== Human Labeler Training ===<br />
The three main difficulties encountered during human labeler training are fine-grained distinctions, class unawareness, and insufficient training images. Thus, three training regimens are provided to address the problems listed above, respectively. First, labelers will be assigned extra training tasks with immediate feedbacks on similar classes. Second, labelers will be provided access to search for specific classes during labeling. Finally, the training set will contain a reasonable amount of images for each class. There are difficult class distinctions for humans especially with the fine-grained distinction. To help humans perform well on these class distinction, the training tasks only contained images from certain animal families.<br />
<br />
=== Labeling Guide ===<br />
A labeling guide is constructed to distill class analysis learned during training into discriminative traits that could be used as a reference during the final labeling evaluation.<br />
<br />
=== Final Evaluation and Review ===<br />
Two samples, each containing 1000 images, are sampled from ImageNet and ImageNetV2, respectively, They are sampled in a class-balanced manner and shuffled together. Over 28 days, all five participants labeled all images. They spent a median of 26 seconds per image. After labeling is completed, an additional multi-label annotation session was conducted, in which human predictions for all images are manually reviewed. Comparing to the initial round of labeling, 37% of the labels changes due to participants' greater familiarity with the classes.<br />
<br />
== Main Results ==<br />
[[File:Evaluating Machine Accuracy on ImageNet Figure 1.png | center]]<br />
<br />
<div align="center">Figure 1</div><br />
<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
From Figure 1, we can see that the difference in accuracies between the datasets is within 1% for all human participants. As hypothesized, human testers indeed performed better than the automated models on both datasets. It's worth noticing that labelers D and E, who did not participate in the initial annotation period, actually performed better than the best automated model.<br />
===Comparison of Human and Machine Accuracies on Image Net===<br />
Based on the results shown in Figure 1, we can see that the confidence interval of the best 4 human participants and 4 best model overlap; however, with a p-value of 0.037 using the McNemar's paired test, it rejects the hypothesis that the FixResNeXt model and Human E labeler have the same accuracy with respect to the ImageNet validation dataset. Figure 1 also shows that the confidence intervals of the labeling accuracies for human labelers C, D, E do not overlap with the confidence interval of the best model with respect to ImageNet-V2 and with the McNemar's test yielding a p-value of <math>2\times 10^{-4}</math>, it is clear that the hypothesis human and machined models have same robustness to model distribution shifts ought to be rejected.<br />
<br />
== Other Observations ==<br />
<br />
[[File: Results_Summary_Table.png| 800px|center]]<br />
<br />
=== Difficult Images ===<br />
<br />
The experiment also shed some light on images that are difficult to label. 10 images were misclassified by all of the human labelers. Among those 10 images, there was 1 image of a monkey and 9 of dogs. In addition, 27 images, with 19 in object classes and 8 in organism classes, were misclassified by all 72 machine learning models in this experiment. Only 2 images were labeled wrong by all human labelers and models. Both images contained dogs. Researchers also noted that difficult images for models are mostly images of objects and exclusively images of animals for human labelers.<br />
<br />
=== Accuracies without dogs ===<br />
<br />
As previously discussed in the paper, machine learning models tend to outperform human labelers when classifying the 118 dog classes. To better understand to what extent models outperform human labelers, researchers computed the accuracies again by excluding all the dog classes. Results showed a 0.6% increase in accuracy on the ImageNet images using the best model and a 1.1% increase on the ImageNet V2 images. In comparison, the mean increases in accuracy for human labelers are 1.9% and 1.8% on the ImageNet and ImageNet V2 images respectively. Researchers also conducted a simulation to demonstrate that the increase in human labeling accuracy on non-dog images is significant. This simulation was done by bootstrapping to estimate the changes in accuracy when only using data for the non-dog classes, and simulation results show smaller increases than in the experiment. <br />
<br />
In conclusion, it's more difficult for human labelers to classify images with dogs than it is for machine learning models.<br />
<br />
=== Accuracies on objects ===<br />
Researchers also computed machine and human labelers' accuracies on a subset of data with only objects, as opposed to organisms, to better illustrate the differences in performance. This test involved 590 object classes. As shown in the table above, there is a 3.3% and 3.4% increase in mean accuracies for human labelers on the ImageNet and ImageNet V2 images. In contrast, there is a 0.5% decrease in accuracy for the best model on both ImageNet and ImageNet V2. This indicates that human labelers are much better at classifying objects than these models are.<br />
<br />
=== Accuracies on fast images ===<br />
Unlike the CNN models, human labelers spent different amounts of time on different images, spanning from several seconds to 40 minutes. To further analyze the images that take human labelers less time to classify, researchers took a subset of images with median labeling time spent by human labelers of at most 60 seconds. These images were referred to as "fast images". There are 756 and 714 fast images from ImageNet and ImageNet V2 respectively, out of the total 2000 images used for evaluation. Accuracies of models and humans on the fast images increased significantly, especially for humans. <br />
<br />
This result suggests that human labelers know when an image is difficult to label and would spend more time on it. It also shows that the models are more likely to correctly label images that human labelers can label relatively quickly.<br />
<br />
== Related Work ==<br />
<br />
=== Human accuracy on ImageNet ===<br />
<br />
Russakovsky et al. (2015) studied two trained human labelers' accuracies on 1500 and 258 images in the context of the ImageNet challenge. The top-5 accuracy of the labeler who labeled 1500 images was the well-known human baseline on ImageNet. <br />
<br />
As introduced before, the researchers went beyond by using multi-label accuracy, using more labelers, and focusing on robustness to small distribution shifts. Although the researchers had some different findings, some results are also consistent with results from (Russakovsky et al., 2015). An example is that both experiments indicated that it takes human labelers around one minute to label an image. The time distribution also has a long tail, due to the difficult images as mentioned before.<br />
<br />
=== Human performance in computer vision broadly ===<br />
There are many examples of recent studies about humans in the area of computer vision, such as investigating human robustness to synthetic distribution change (Geirhos et al., 2017) and studying what characteristics do humans use to recognize objects (Geirhos et al., 2018). Other examples include the adversarial examples constructed to fool both machines and time-limited humans (Elsayed et al., 2018) and illustrating foreground/background objects' effects on human and machine performance (Zhu et al., 2016). <br />
<br />
=== Multi-label annotations ===<br />
Stock & Cissé (2017) also studied ImageNet's multi-label nature, which aligns with the researchers' study in this paper. According to Stock & Cissé (2017), the top-1 accuracy measure could underestimate multi-label by up to 13.2%. The authors suggest that releasing these labelled data to the public will allow for more robust models in the future.<br />
<br />
=== ImageNet inconsistencies and label error ===<br />
Researches have found and recorded some incorrectly labeled images from ImageNet and ImageNet V2 during this study. Earlier studies (Van Horn et al., 2015) also shown that at least 4% of the birds in ImageNet are misclassified. This work also noted that the inconsistent taxonomic structure in birds' classes could lead to weak class boundaries. Researchers also noted that the majority of the fine-grained organism classes also had similar taxonomic issues.<br />
<br />
=== Distribution shift ===<br />
There has been an increasing amount of studies in this area. One focus of the studies is distributionally robust optimization (DRO), which finds the model that has the smallest worst-case expected error over a set of probability distributions. Another focus is on finding the model with the lowest error rates on adversarial examples. Work in both areas has been productive, but none was shown to resolve the drop in accuracies between ImageNet and ImageNet V2. A recent [https://papers.nips.cc/paper/2019/file/8558cb408c1d76621371888657d2eb1d-Paper.pdf paper] also discusses quantifying uncertainty under a distribution shift, in other words whether the output of probabilistic deep learning models should or should not be trusted.<br />
<br />
== Conclusion and Future Work ==<br />
<br />
=== Conclusion ===<br />
Researchers noted that in order to achieve truly reliable machine learning, they need a deeper understanding of the range of parameters where the model still remain robust. Techniques from Combinatorics and sensitivity analysis, in particular, might yield fruitful results. This study has provided valuable insights into the desired robustness properties by comparing model performance to human performance. This is especially evident given the results of the experiment which show humans drastically outperforming machine learning in many cases and proposes the question of how much accuracy one is willing to give up in exchange for efficiency. The results have shown that current performance benchmarks are not addressing the robustness to small and natural distribution shifts, which are easily handled by humans.<br />
<br />
=== Future work ===<br />
Other than improving the robustness of models, researchers should consider investigating if less-trained human labelers can achieve a similar level of robustness to distributional shifts. Since the accuracy of untrained labeling machines may be lower in the system, this will further illustrate that human robustness is a more stable attribute than direct accuracy measurement. In addition, researchers can study the robustness to temporal changes, which is another form of natural distribution shift (Gu et al., 2019; Shankar et al., 2019). Also, Convolutional Neural Network can be a candidate to improve the accuracy of classifying images.<br />
<br />
== Critiques ==<br />
# The method of using human to classify Imagenet is fully circular, since the label of imagenet itself is originally annotated by human beings. In fact, the classification scheme itself is intrinsically human construction. It is not logical to test human performance with human performance. This circular contsruction completely violates scientific principles.<br />
# Table 1 simply showed a difference in ImageNet multi-label accuracy yet does not give an explicit reason as to why such a difference is present. Although the paper suggested the distribution shift has caused the difference, it does not give other factors to concretely explain why the distribution shift was the cause.<br />
# With the recommendation to future machine evaluations, the paper proposed to "Report performances on dogs, other animals, and inanimate objects separately.". Despite its intentions, it is narrowly specific and requires further generalization for it to be convincing. <br />
# With choosing human subjects as samplers, no further information was given as to how they are chosen nor there are any background information was given. As it is a classification problem involving many classes as specific to species, a biology student would give far more accurate results than a computer science student or a math student. To make this study more plausible, more human labelers should be sampled.<br />
# As explaining the importance of multi-label metrics using comparison to Top-5 metric, the turtle example falls within the overall similarity (simony) classification of the multi-label evaluation metric, as such, if the Top-5 evaluation suggests any one of the turtle species were selected, the algorithm is considered to produce a correct prediction which is the intention. The example does not convey the necessity of changing to the proposed metric over the Top-5 metric. <br />
# With the definition in the paper regarding multi-label metrics, it is hard to see why expanding the label set is different from a traditional Top-5 metric or rather necessary, ergo does not yield the claim which the proposed metric is necessary for rigorous accuracy evaluation on ImageNet.<br />
# When discussing the main results, the paper discusses the hypothesis on distribution shift having no effects on human and machine model accuracies; the presentation is poor at best with no clear centric to what they are trying to convey to how (in detail) they resulted in such claims.<br />
# In the experiment setup of the presentation, there are a lot of key terms without detailed description. For example, Human labeler training using a subset of the remaining 30,000 unannotated images in the ImageNet validation set, labelers A, B, C, D, and E underwent extensive training to understand the intricacies of fine-grained class distinctions in the ImageNet class hierarchy. Authors should clarify each key term in the presentation otherwise readers are hard to follow.<br />
# Not sure how the human samplers were determined and simply picking several people will have really high bias because the sample is too small and they have different background which will definitely affect the results a lot. Also, it will be better if there are more comparisons between the model introduced and other models.<br />
# Given the low amount of human participants, it is hard to take the results seriously (there is too much variance). Also it's not exactly clear how the authors determined that the multi-label accuracy metric measures a semantically more meaningful notion of accuracy compared to its counterparts. For example, one of the issues with top-5 accuracy that they mention is: "For instance, within the dataset, five turtle classes are given which is difficult to distinguish under such classification evaluations." But it's not clear how multi-label accuracy would be better in this instance.<br />
# It is unclear how well the human labeler can perform labeling after training. So the final result is not that trust-worthy.<br />
# In this experiment set up, label annotators are the same as participants of the experiments. Even if there's a break between the annotating and evaluating human labeler evaluation, the impact of the break in reducing bias is not clear. One potential human labeling data is google's "I'm not a robot" verification test. One variation of the verification test asks users to select all the photos from 9 images that are related to a certain keyword. This allows for a more accurate measurement of human performance vs ImageNet performance. In addition, it's going to reduce the biases from the small number of experiment participants.<br />
# Following Table 2, the authors appear to try and claim that the model is better than the human labelers, simply because the model experienced a better increase in classification following the removal of dog photos then the human labeler did, however, a quick look at the table shows that most human labelers still performed better than the best model. The authors should be making the claim that human labelers are better at labeling dogs than the modal, but are still better overall after removing the dogs dataset.<br />
# The reason why human labeler outperforms CNN could be human had much more training. It would be more convincing if the paper could provide a metric in order to measure human labelers' training data set size.<br />
# Actually, in the multi-label case, it is vague to determine whether the machine learning model or the human labellers were giving the correct label. The structure of the dataset is pretty essential in training a network, in which data with uncertain label (even determined by human) should be avoided.<br />
# The authors mentioned that untrained labelers will likely be in lower accuracy, they can give a standard or definition about a well-trained labeler.<br />
# I believe the authors needed to include more information about how they determined the samples such as human samplers, and also more details on how to define unclear images.<br />
# It would be more convincing if the author could provide the criteria of being human samplers and unclear images, and the accuracy of the human labeler.<br />
# The summary only explains some model components but does not thoroughly goes through the big picture of the model; data-preprocessing, training, and prediction procedures. It would be nice to know the details as well.<br />
# It seems the core problem is more about the dataset itself and not the evaluation procedure. We would not have issues with top 1 and top 5 if Imagenet contained discernable classes with good labels. Of course, this is very expensive, and imagenet is an _excellent_ dataset given these constraints. It does not seem like their proposed solution, multiple labels per image, addresses their concerns properly, as other critiques have already mentioned. Furthermore, having multiple labels per image does not translate to real-life value the same way that the top 5 or top 1 metric does, as in the common case, there is one right answer for a classification problem.<br />
# The paper could provide details on ways to improve the accuracy and robustness of the model. Since the paper mentions CNN, it could provide details of the model and why CNN is a good candidate.<br />
# The accuracy of the model is directly correlated with how the images are labelled. In all multi-label annotations, the authors describe a predicted label as correct if it is within a set of "correct labels" where each image has a different number of correct labels. Perhaps it would yield better results if the model were to first identify the number of objects in the image first and then by using some form of criteria, it labels those identified objects in order of importance (i.e. objects that are closer are labelled first). The authors also never specified what criteria the model uses to "pick out" which object it will label in the image.<br />
# The paper mentions difficult images and fast images. It would be better if the paper had generalized the type of images that constitute difficult images (i.e., the paper mentions 118 dog classes, what are some general characteristics of difficult images?) In addition, it would be interesting to compare the performance between human and machine accuracy on non-fast images.<br />
# The paper meaingfully and correctly points out the problem that the current evaluation of ML algorithm by only using the accuracy on Imagnet as the bechmark is simplistic and probelmatic. However, the idea of comparing human performance with ML models is problematic, since it is hard or even impossible to control the various variables that can drastically change human performance: training time, domain knowledge, cognitive function, amount of work-load, and various environmental factors. In order to compare different experimental methods, the most important step is to carefully control the confounding variables to reach any meaningful conclusion.<br />
# The original question would be how ImageNet was created? The images can only be labeled by human labelers. There might be some mistakes and missing details. The combination of human evaluation and machine evaluation might be helpful to resolve this problem.<br />
# Regarding the difficult problems, CatsandDogs is a very popular dataset that in this day it is trivial for a non-deep CNN to classify with high accuracies. The author would do well to test this CNN with other classes, since its inconceivable that a CNN cannot learn the characteristics of a dog with relation to the other classes in Imagenet. Why would it be able to differentiate the breeds of dogs, but not other organisms?<br />
# The model is limited by the representativeness of small samples and human participants. For the evaluation performance of the human group, the randomness of the sampling of participants should be confirmed first. Or the author need to formulate a rule to compare the accuracy of the human group and the machine group under this framework, to make sure the result is robustness.<br />
# This is rather late, but I just want to point out that the relationship between top-1 accuracy and multi-label accuracy(same with top-5 accuracy and multi-label accuracy) are not illustrated very clear. Also, with ImageNet test for all the models, it should matter how multi-label accuracy is highly correlated with top-1 and top-5. How they change with respect to others can be explained a little more.<br />
<br />
== Reference ==<br />
[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020). Evaluating Machine Accuracy on ImageNet. ICML 2020.<br />
<br />
[2] Krizhevsky, A., Sutskever, I., & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. 2. Retrieved from http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf</div>J52donghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Being_Bayesian_about_Categorical_ProbabilityBeing Bayesian about Categorical Probability2020-11-30T00:36:54Z<p>J27ni: /* Conclusion and Critiques */</p>
<hr />
<div>== Presented By ==<br />
Evan Li, Jason Pu, Karam Abuaisha, Nicholas Vadivelu<br />
<br />
== Introduction ==<br />
<br />
Since the outputs of neural networks are not probabilities, Softmax (Bridle, 1990) is a staple for neural network’s performing classification -- it exponentiates each logit then normalizes by the sum, giving a distribution over the target classes. Logit is a raw output/prediction of the model which is hard for humans to interpret, thus we transform/normalize these raw values into categories or meaningful numbers for interpretability. However, networks with softmax outputs give no information about uncertainty (Blundell et al., 2015; Gal & Ghahramani, 2016), and the resulting distribution over classes is poorly calibrated (Guo et al., 2017), often giving overconfident predictions even when the classification is wrong. In addition, softmax also raises concerns about overfitting NNs due to its confident predictive behaviors (Xie et al., 2016; Pereyra et al., 2017). To achieve performance with better generalization, some more effective regularization techniques might be required. <br />
<br />
Bayesian Neural Networks (BNNs; MacKay, 1992) can alleviate these issues, but the resulting posteriors over the parameters are often intractable. Approximations such as variational inference (Graves, 2011; Blundell et al., 2015) and Monte Carlo Dropout (Gal & Ghahramani, 2016) can still be expensive or give poor estimates for the posteriors. This work proposes a Bayesian treatment of the output logits of the neural network, treating the targets as a categorical random variable instead of a fixed label. This technique gives a computationally cheap way of being Bayesian to get well-calibrated uncertainty estimates on neural network classifications.<br />
<br />
== Related Work ==<br />
<br />
Using Bayesian Neural Networks is the dominant way of applying Bayesian techniques to neural networks. A Bayesian neural network is a stochastic artificial neural network that is trained using Bayesian inference. Bayesian neural networks usually have better calibration than classical neural networks, which indicates that their predicted uncertainty is more consistent with the observed errors. Bayesian networks are data-efficient and can learn with small datasets without overfitting (Jospin, Buntine, Boussaid, Laga, & Bennamoun, 2020). Many techniques have been developed to make posterior approximation more accurate and scalable, despite these, BNNs do not scale to the state of the art techniques or large data sets. There are techniques to explicitly avoid modeling the full weight posterior that is more scalable, such as with Monte Carlo Dropout (Gal & Ghahramani, 2016) or tracking mean/covariance of the posterior during training (Mandt et al., 2017; Zhang et al., 2018; Maddox et al., 2019; Osawa et al., 2019). Non-Bayesian uncertainty estimation techniques such as deep ensembles (Lakshminarayanan et al., 2017) and temperature scaling (Guo et al., 2017; Neumann et al., 2018).<br />
<br />
== Preliminaries ==<br />
=== Definitions ===<br />
Let's formalize our classification problem and define some notations for the rest of this summary:<br />
<br />
::Dataset:<br />
$$ \mathcal D = \{(x_i,y_i)\} \in (\mathcal X \times \mathcal Y)^N $$<br />
::General classification model<br />
$$ f^W: \mathcal X \to \mathbb R^K $$<br />
::Softmax function: <br />
$$ \phi(x): \mathbb R^K \to [0,1]^K \;\;|\;\; \phi_k(X) = \frac{\exp(f_k^W(x))}{\sum_{k \in K} \exp(f_k^W(x))} $$<br />
::Softmax activated NN:<br />
$$ \phi \;\circ\; f^W: \chi \to \Delta^{K-1} $$<br />
::NN as a true classifier:<br />
$$ arg\max_i \;\circ\; \phi_i \;\circ\; f^W \;:\; \mathcal X \to \mathcal Y $$<br />
<br />
We'll also define the '''count function''' - a <math>K</math>-vector valued function that outputs the occurences of each class coincident with <math>x</math>:<br />
$$ c^{\mathcal D}(x) = \sum_{(x',y') \in \mathcal D} \mathbb y' I(x' = x) $$<br />
<br />
=== Classification With a Neural Network ===<br />
A typical loss function used in classification is cross-entropy, which is defined by<br />
<br />
$$ l_{\rm CE}(\tilde{y},\phi(f^{W}(x)))=-\sum_k \tilde{y_k} \log \phi_k(f^{W}(x))) $$<br />
<br />
,here <math>y_k</math> and <math>\phi_k</math> refers to the actual and predicted categorical distribution for each class. It's well known that optimizing <math>f^W</math> for <math>l_{CE}</math> is equivalent to optimizing for <math>l_{KL}</math>, the <math>KL</math> divergence between the true distribution and the distribution modeled by NN, that is:<br />
$$ l_{KL}(W) = KL(\text{true distribution} \;|\; \text{distribution encoded by }NN(W)) $$<br />
Let's introduce notations for the underlying (true) distributions of our problem. Let <math>(x_0,y_0) \sim (\mathcal X \times \mathcal Y)</math>:<br />
$$ \text{Full Distribution} = F(x,y) = P(x_0 = x,y_0 = y) $$<br />
$$ \text{Marginal Distribution} = P(x) = F(x_0 = x) $$<br />
$$ \text{Point Class Distribution} = P(y_0 = y \;|\; x_0 = x) = F_x(y) $$<br />
Then we have the following factorization:<br />
$$ F(x,y) = P(x,y) = P(y|x)P(x) = F_x(y)F(x) $$<br />
Substitute this into the definition of KL divergence:<br />
$$ = \sum_{(x,y) \in \mathcal X \times \mathcal Y} F(x,y) \log\left(\frac{F(x,y)}{\phi_y(f^W(x))}\right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F(y|x) \log\left( \frac{F(y|x)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) \sum_{y \in \mathcal Y} F_x(y) \log\left( \frac{F_x(y)}{\phi_y(f^W(x))} \right) $$<br />
$$ = \sum_{x \in \mathcal X} F(x) KL(F_x \;||\; \phi\left( f^W(x) \right)) $$<br />
As usual, we don't have an analytic form for <math>l</math> (if we did, this would imply we know <math>F_X</math> meaning we knew the distribution in the first place). Instead, estimate from <math>\mathcal D</math>:<br />
$$ F(x) \approx \hat F(x) = \frac{||c^{\mathcal D}(x)||_1}{N} $$<br />
$$ F_x(y) \approx \hat F_x(y) = \frac{c^{\mathcal D}(x)}{|| c^{\mathcal D}(x) ||_1}$$<br />
$$ \to l_{KL}(W) = \sum_{x \in \mathcal D} \frac{||c^{\mathcal D}(x)||_1}{N} KL \left( \frac{c^{\mathcal D}(x)}{||c^{\mathcal D}(x)||_1} \;||\; \phi(f^W(x)) \right) $$<br />
The approximations <math>\hat F, \hat F_X</math> are often not very good though: consider a typical classification such as MNIST, we would never expect two handwritten digits to produce the exact same image. Hence <math>c^{\mathcal D}(x)</math> is (almost) always going to have a single index 1 and the rest 0. This has implications for our approximations:<br />
$$ \hat F(x) \text{ is uniform for all } x \in \mathcal D $$<br />
$$ \hat F_x(y) \text{ is degenerate for all } x \in \mathcal D $$<br />
This clearly has implications for overfitting: to minimize the KL term in <math>l_{KL}(W)</math> we want <math>\phi(f^W(x))</math> to be very close to <math>\hat F_x(y)</math> at each point - this means that the loss function is in fact encouraging the neural network to output near degenerate distributions! <br />
<br />
'''Label Smoothing'''<br />
<br />
One form of regularization to help this problem is called label smoothing. Instead of using the degenerate $$F_x(y)$$ as a target function, let's "smooth" it (by adding a scaled uniform distribution to it) so it's no longer degenerate:<br />
$$ F'_x(y) = (1-\lambda)\hat F_x(y) + \frac \lambda K \vec 1 $$<br />
<br />
'''BNNs'''<br />
<br />
BBNs balances the complexity of the model and the distance to target distribution without choosing a single beset configuration (one-hot encoding). Specifically, BNNs with the Gaussian Weight prior $$F_x(y) = N (0,T^{-1} I)$$ has score of configuration <math>W</math> measured by the posterior density $$p_W(W|D) = p(D|W)p_W(W), \log(p_W(W)) = T||W||^2_2$$<br />
Here <math>||W||^2_2</math> could be a poor proxy to penalized for the model complexity due to its linear nature.<br />
<br />
== Method ==<br />
The main technical proposal of the paper is a Bayesian framework to estimate the (former) target distribution <math>F_x(y)</math>. That is, we construct a posterior distribution of <math> F_x(y) </math> and use that as our new target distribution. We call it the ''belief matching'' (BM) framework.<br />
<br />
=== Constructing Target Distribution ===<br />
Recall that <math>F_x(y)</math> is a k-categorical probability distribution - its PMF can be fully characterized by k numbers that sum to 1. Hence we can encode any such <math>F_x</math> as a point in <math>\Delta^{k-1}</math>. We'll do exactly that - let's call this vector <math>z</math>:<br />
$$ z \in \Delta^{k-1} $$<br />
$$ \text{prior} = p_{z|x}(z) $$<br />
$$ \text{conditional} = p_{y|z,x}(y) $$<br />
$$ \text{posterior} = p_{z|x,y}(z) $$<br />
Then if we perform inference:<br />
$$ p_{z|x,y}(z) \propto p_{z|x}(z)p_{y|z,x}(y) $$<br />
The distribution chosen to model prior was <math>dir_K(\beta)</math>:<br />
$$ p_{z|x}(z) = \frac{\Gamma(||\beta||_1)}{\prod_{k=1}^K \Gamma(\beta_k)} \prod_{k=1}^K z_k^{\beta_k - 1} $$<br />
Note that by definition of <math>z</math>: <math> p_{y|x,z} = z_y </math>. Since the Dirichlet is a conjugate prior to categorical distributions we have a convenient form for the mean of the posterior:<br />
$$ \bar{p_{z|x,y}}(z) = \frac{\beta + c^{\mathcal D}(x)}{||\beta + c^{\mathcal D}(x)||_1} \propto \beta + c^{\mathcal D}(x) $$<br />
This is in fact a generalization of (uniform) label smoothing (label smoothing is a special case where <math>\beta = \frac 1 K \vec{1} </math>).<br />
<br />
=== Representing Approximate Distribution ===<br />
Our new target distribution is <math>p_{z|x,y}(z)</math> (as opposed to <math>F_x(y)</math>). That is, we want to construct an interpretation of our neural network weights to construct a distribution with support in <math> \Delta^{K-1} </math> - the NN can then be trained so this encoded distribution closely approximates <math>p_{z|x,y}</math>. Let's denote the PMF of this encoded distribution <math>q_{z|x}^W</math>. This is how the BM framework defines it:<br />
$$ \alpha^W(x) := \exp(f^W(x)) $$<br />
$$ q_{z|x}^W(z) = \frac{\Gamma(||\alpha^W(x)||_1)}{\sum_{k=1}^K \Gamma(\alpha_k^W(x))} \prod_{k=1}^K z_{k}^{\alpha_k^W(x) - 1} $$<br />
$$ \to Z^W_x \sim dir(\alpha^W(x)) $$<br />
Apply <math>\log</math> then <math>\exp</math> to <math>q_{z|x}^W</math>:<br />
$$ q^W_{z|x}(z) \propto \exp \left( \sum_k (\alpha_k^W(x) \log(z_k)) - \sum_k \log(z_k) \right) $$<br />
$$ \propto -l_{CE}(\phi(f^W(x)),z) + \frac{K}{||\alpha^W(x)||}KL(\mathcal U_k \;||\; z) $$<br />
It can actually be shown that the mean of <math>Z_x^W</math> is identical to <math>\phi(f^W(x))</math> - in other words, if we output the mean of the encoded distribution of our neural network under the BM framework, it is theoretically identical to a traditional neural network.<br />
<br />
In the limit of <math> q^W_{z|x}(z) \rightarrow p_{z|x}(z)</math>, mean of the target posterior becomes a virtual label, for which individual z ought to match. Hence, the penalty for ambiguous configuration is determined by the number of observations. Therefore, the distribution matching in BM can be thought of as '''learning to score a categorical probability''' based on the closeness of the posterior mean, in which exploitation on the closeness of information is automatically controlled by the data.<br />
<br />
=== Distribution Matching ===<br />
<br />
We now need a way to fit our approximate distribution from our neural network <math>q_{\mathbf{z | x}}^{\mathbf{W}}</math> to our target distribution <math>p_{\mathbf{z|x},y}</math>. The authors achieve this by maximizing the evidence lower bound (ELBO):<br />
<br />
$$l_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) $$<br />
<br />
Each term can be computed analytically:<br />
<br />
$$\mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf W }} \left[\log z_y \right] = \psi(\alpha_y^{\mathbf W} ( \mathbf x )) - \psi(\alpha_0^{\mathbf W} ( \mathbf x )) $$<br />
<br />
Where <math>\psi(\cdot)</math> represents the digamma function (logarithmic derivative of gamma function). Intuitively, we maximize the probability of the correct label. For the KL term:<br />
<br />
$$KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; p_{\mathbf{z|x}}) = \log \frac{\Gamma(a_0^{\mathbf W}(\mathbf x)) \prod_k \Gamma(\beta_k)}{\prod_k \Gamma(\alpha_k^{\mathbf W}(x)) \Gamma (\beta_0)} + \sum_k (\alpha_k^{\mathbf W}(x)-\beta_k)(\psi(\alpha_k^{\mathbf W}(\mathbf x)) - \psi(\alpha_0^{\mathbf W}(\mathbf x)) $$<br />
<br />
In the first term, for intuition, we can ignore <math>\alpha_0</math> and <math>\beta_0</math> since those just calibrate the distributions. Otherwise, we want the ratio of the products to be as close to 1 as possible to minimize the KL. In the second term, we want to minimize the difference between each individual <math>\alpha_k</math> and <math>\beta_k</math>, scaled by the normalized output of the neural network. <br />
<br />
This loss function can be used as a drop-in replacement for the standard softmax cross-entropy, as it has an analytic form and the same time complexity as typical softmax-cross entropy with respect to the number of classes (<math>O(K)</math>).<br />
<br />
=== On Prior Distributions ===<br />
<br />
We must choose our concentration parameter, <math>\beta</math>, for our dirichlet prior. We see our prior essentially disappears as <math>\beta_0 \to 0</math> and becomes stronger as <math>\beta_0 \to \infty</math>. Thus, we want a small <math>\beta_0</math> so the posterior isn't dominated by the prior. But, the authors claim that a small <math>\beta_0</math> makes <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small, which causes <math>\psi (\alpha_0^{\mathbf W}(\mathbf x))</math> to be large, which is problematic for gradient based optimization. In practice, many neural network techniques aim to make <math>\mathbb E [f^{\mathbf W} (\mathbf x)] \approx \mathbf 0</math> and thus <math>\mathbb E [\alpha^{\mathbf W} (\mathbf x)] \approx \mathbf 1</math>, which means making <math>\alpha_0^{\mathbf W}(\mathbf x)</math> small can be counterproductive.<br />
<br />
So, the authors set <math>\beta = \mathbf 1</math> and introduce a new hyperparameter <math>\lambda</math> which is multiplied with the KL term in the ELBO:<br />
<br />
$$l^\lambda_{EB}(\mathbf y, \alpha^{\mathbf W}(\mathbf x)) = \mathbb E_{q_{\mathbf{z | x}}^{\mathbf{W}}} \left[\log p(\mathbf {y | x, z})\right] - \lambda KL (q_{\mathbf{z | x}}^{\mathbf W} \; || \; \mathcal P^D (\mathbf 1)) $$<br />
<br />
This stabilizes the optimization, as we can tell from the gradients:<br />
<br />
$$\frac{\partial l_{E B}\left(\mathbf{y}, \alpha^{\mathbf W}(\mathbf{x})\right)}{\partial \alpha_{k}^{\mathbf W}(\mathbf {x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\alpha_{k}^{\mathbf W}(\mathbf{x})-\beta_{k}\right)\right) \psi^{\prime}\left(\alpha_{k}^{\mathbf{W}}(\boldsymbol{x})\right)<br />
-\left(1-\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})-\beta_{0}\right)\right) \psi^{\prime}\left(\alpha_{0}^{\boldsymbol{W}}(\boldsymbol{x})\right)$$<br />
<br />
$$\frac{\partial l_{E B}^{\lambda}\left(\mathbf{y}, \alpha^{\mathbf{W}}(\mathbf{x})\right)}{\partial \alpha_{k}^{W}(\mathbf{x})}=\left(\tilde{\mathbf{y}}_{k}-\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})-\lambda\right)\right) \frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}<br />
-\left(1-\left(\tilde{\alpha}_{0}^{W}(\mathbf{x})-\lambda K\right)\right)$$<br />
<br />
As we can see, the first expression is affected by the magnitude of <math>\alpha^{\boldsymbol{W}}(\boldsymbol{x})</math>, whereas the second expression is not due to the <math>\frac{\psi^{\prime}\left(\tilde{\alpha}_{k}^{\mathbf W}(\mathbf{x})\right)}{\psi^{\prime}\left(\tilde{\alpha}_{0}^{\mathbf W}(\mathbf{x})\right)}</math> ratio.<br />
<br />
== Experiments ==<br />
<br />
Throughout the experiments in this paper, the authors employ various models based on residual connections (He et al., 2016 [1]) which are the models used for benchmarking in practice. We will first demonstrate improvements provided by BM, then we will show versatility in other applications. For fairness of comparisons, all configurations in the reference implementation will be fixed. The only additions in the experiments are initial learning rate warm-up and gradient clipping which are extremely helpful for stable training of BM. <br />
<br />
=== Generalization performance === <br />
The paper compares the generalization performance of BM with softmax and MC dropout on CIFAR-10 and CIFAR-100 benchmarks.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T1.png]]<br />
<br />
The next comparison was performed between BM and softmax on the ImageNet benchmark. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_T2.png]]<br />
<br />
For both datasets and In all configurations, BM achieves the best generalization and outperforms softmax and MC dropout.<br />
<br />
===== Regularization effect of prior =====<br />
<br />
In theory, BM has 2 regularization effects:<br />
The prior distribution, which smooths the target posterior<br />
Averaging all of the possible categorical probabilities to compute the distribution matching loss<br />
The authors perform an ablation study to examine the 2 effects separately - removing the KL term in the ELBO removes the effect of the prior distribution.<br />
For ResNet-50 on CIFAR-100 and CIFAR-10 the resulting test error rates were 24.69% and 5.68% respectively. <br />
<br />
This demonstrates that both regularization effects are significant since just having one of them improves the generalization performance compared to the softmax baseline, and having both improves the performance even more.<br />
<br />
===== Impact of <math>\beta</math> =====<br />
<br />
The effect of β on generalization performance is studied by training ResNet-18 on CIFAR-10 by tuning the value of β on its own, as well as jointly with λ. It was found that robust generalization performance is obtained for β ∈ [<math>e^{−1}, e^4</math>] when tuning β on its own; and β ∈ [<math>e^{−4}, e^{8}</math>] when tuning β jointly with λ. The figure below shows a plot of the error rate with varying β.<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F3.png]]<br />
<br />
=== Uncertainty Representation ===<br />
<br />
One of the big advantages of BM is the ability to represent uncertainty about the prediction. The authors evaluate the uncertainty representation on in-distribution (ID) and out-of-distribution (OOD) samples. <br />
<br />
===== ID uncertainty =====<br />
<br />
For ID (in-distribution) samples, calibration performance is measured, which is a measure of how well the model’s confidence matches its actual accuracy. This measure can be visualized using reliability plots and quantified using a metric called expected calibration error (ECE). ECE is calculated by grouping predictions into M groups based on their confidence score and then finding the absolute difference between the average accuracy and average confidence for each group. We can define the ECE of <math>f^W </math> on <math>D </math> with <math>M</math> groups as <br />
<br />
<center><br />
<math>ECE_M(f^W, D) = \sum^M_{i=1} \frac{|G_i|}{|D|}|acc(G_i) - conf(G_i)|</math><br />
</center><br />
Where <math>G_i</math> is a set of samples int the i-th group defined as <math>G_i = \{j:i/M < max_k\phi_k(f^Wx^{(j)}) \leq (1+i)/M\}</math>, <math>acc(G_i)</math> is an average accuracy in the i-th group and <math>conf(G_i)</math> is an average confidence in the i-th group.<br />
<br />
The figure below is a reliability plot of ResNet-50 on CIFAR-10 and CIFAR-100 with 15 groups. It shows that BM has a significantly better calibration performance than softmax since the confidence matches the accuracy more closely (this is also reflected in the lower ECE).<br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F4.png]]<br />
<br />
===== OOD uncertainty =====<br />
<br />
Here, the authors quantify uncertainty using predictive entropy - the larger the predictive entropy, the larger the uncertainty about a prediction. <br />
<br />
The figure below is a density plot of the predictive entropy of ResNet-50 on CIFAR-10. It shows that BM provides significantly better uncertainty estimation compared to other methods since BM is the only method that has a clear peak of high predictive entropy for OOD samples which should have high uncertainty. <br />
<br />
[[File:Being_Bayesian_about_Categorical_Probability_F5.png]]<br />
<br />
=== Transfer learning ===<br />
<br />
Belief matching applies the Bayesian principle outside the neural network, which means it can easily be applied to already trained models. Thus, belief matching can be employed in transfer learning scenarios. The authors downloaded the ImageNet pre-trained ResNet-50 weights and fine-tuned the weights of the last linear layer for 100 epochs using an Adam optimizer.<br />
<br />
This table shows the test error rates from transfer learning on CIFAR-10, Food-101, and Cars datasets. Belief matching consistently performs better than softmax. <br />
<br />
[[File:being_bayesian_about_categorical_probability_transfer_learning.png]]<br />
<br />
Belief matching was also tested for the predictive uncertainty for out of dataset samples based on CIFAR-10 as the in distribution sample. Looking at the figure below, it is observed that belief matching significantly improves the uncertainty representation of pre-trained models by only fine-tuning the last layer’s weights. Note that belief matching confidently predicts examples in Cars since CIFAR-10 contains the object category automobiles. In comparison, softmax produces confident predictions on all datasets. Thus, belief matching could also be used to enhance the uncertainty representation ability of pre-trained models without sacrificing their generalization performance.<br />
<br />
[[File: being_bayesian_about_categorical_probability_transfer_learning_uncertainty.png]]<br />
<br />
=== Semi-Supervised Learning ===<br />
<br />
Belief matching’s ability to allow neural networks to represent rich information in their predictions can be exploited to aid consistency based loss function for semi-supervised learning. Consistency-based loss functions use unlabelled samples to determine where to promote the robustness of predictions based on stochastic perturbations. This can be done by perturbing the inputs (which is the VAT model) or the networks (which is the pi-model). Both methods minimize the divergence between two categorical probabilities under some perturbations, thus belief matching can be used by the following replacements in the loss functions. The hope is that belief matching can provide better prediction consistencies using its Dirichlet distributions.<br />
<br />
[[File: being_bayesian_about_categorical_probability_semi_supervised_equation.png]]<br />
<br />
The results of training on ResNet28-2 with consistency based loss functions on CIFAR-10 are shown in this table. Belief matching does have lower classification error rates compared to using a softmax.<br />
<br />
[[File:being_bayesian_about_categorical_probability_semi_supervised_table.png]]<br />
<br />
== Conclusion and Critiques ==<br />
<br />
* Bayesian principles can be used to construct the target distribution by using the categorical probability as a random variable rather than a training label. This can be applied to neural network models by replacing only the softmax and cross-entropy loss while improving the generalization performance, uncertainty estimation and well-calibrated behavior. <br />
<br />
* In the future, the authors would like to allow for more expressive distributions in the belief matching framework, such as logistic normal distributions to capture strong semantic similarities among class labels. Furthermore, using input dependent priors would allow for interesting properties that would aid imbalanced datasets and multi-domain learning.<br />
<br />
* Overall I think this summary is very good. The Method(Algorithm) section is described clearly, and the Results section is detailed, with many diagrams illustrating the main points. I just have one technical suggestion: the difference in performance for SOFTMAX and BM differs by model. For example, for RESNEXT-50 model, the difference in top1 is 0.2, whereas, for the RESNEXT-100 model, the difference in the top one is 0.5, which is significantly higher. It's true that BM method generally outperforms SOFTMAX. But seeing the relation between the choice of model and the magnitude of a performance increase could definitely strengthen the paper even further.<br />
<br />
* The summary is good and the topic is interesting. Bayesian is a well know probabilistic model but did not know that it can be used as a neural network. Comparison between softmax and bayesian was interesting and more details would be great.<br />
<br />
* It would be better if there is a future work section to discuss the current shortage and potential improvement. One thing would be that the theoretical part is complex in the process. In addition, optimizing a function is relatively hard if the structure is complex. Is it possible to have a good approximation without having too complex a calculation?<br />
<br />
* Both experiments dealt with image data, however, softmax is used within classification neural networks that range from image to textual data. It would be interesting to see the performance of BM on textual data for text classification problems in addition to image classification.<br />
<br />
* It would be better to briefly explain Bayesian treatment in the introduction part(i.e., considering the categorical probability as a random variable, construct the target distribution by means of the Bayesian inference), and to analyze the importance of considering the categorical probability as a random variable (for example explain it can be adapted to existing deep learning building blocks without huge modifications).<br />
<br />
* Interesting topic that goes close to our lectures. Since this is a summary of the paper, it would be better if trim the explanation on Neural Network a little like getting rid of the substitution lines.<br />
<br />
* I really liked the presentation and actually really appreciate the steps of the detailed derivation that were presented in this summary. In the introduction the researchers mentioned that BM is a computationally cheap method, however, I was wondering how much faster it is computationally as opposed to the other models to train. Additionally, the training data that was used to benchmark the classification performance seemed to all be image classifications (CIFAR-10, CIFAR-100, ResNet-50, ResNet-101), thus it would have been nice to see classification be applied in other multi-class contexts as well to see how well this new method performs there.<br />
<br />
* It would be more clear if the metric used in evaluating the models is briefly explained!<br />
<br />
* The author's work by applying a Bayesian treatment of the output logits of the neural network to reach a more effective regularization technique is impressive. However, the author lacks a comparison of the calculation efficiency difference between the results. When dealing with large data sets, it is difficult to be convincing if there is no advantage in computational efficiency.<br />
<br />
* The paper was well written with few flaws, data was meticulously presented, yet with ensemble networks, it is important to notify the computational complexity increase that comes with training multiple models. It would also be nice to see their loss and entropy with regards to a standard training process (same epoch, train test split etc).<br />
<br />
* In the original paper it contains a figure 2 of Penultimate layer’s activations of examples belonging to one of three classes (beaver, dolphin, and otter; indexed by 0,1,2 in CIFAR-100). The figure vividly shows how successful the result is, so it would be better contained in the summary.<br />
<br />
== Citations ==<br />
<br />
[1] Bridle, J. S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pp. 227–236. Springer, 1990.<br />
<br />
[2] Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In International Conference on Machine Learning, 2015.<br />
<br />
[3] Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.<br />
<br />
[4] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, 2017. <br />
<br />
[5] MacKay, D. J. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448– 472, 1992.<br />
<br />
[6] Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, 2011. <br />
<br />
[7] Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18(1):4873–4907, 2017.<br />
<br />
[8] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. In International Conference of Machine Learning, 2018.<br />
<br />
[9] Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[10] Osawa, K., Swaroop, S., Jain, A., Eschenhagen, R., Turner, R. E., Yokota, R., and Khan, M. E. Practical deep learning with Bayesian principles. In Advances in Neural Information Processing Systems, 2019.<br />
<br />
[11] Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.<br />
<br />
[12] Neumann, L., Zisserman, A., and Vedaldi, A. Relaxed softmax: Efficient confidence auto-calibration for safe pedestrian detection. In NIPS Workshop on Machine Learning for Intelligent Transportation Systems, 2018.<br />
<br />
[13] Xie, L., Wang, J., Wei, Z., Wang, M., and Tian, Q. Disturblabel: Regularizing cnn on the loss layer. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.<br />
<br />
[14] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.<br />
<br />
[15] Jospin, L. V., Buntine, W. V., Boussaid, F. V., Laga, H. V., & Bennamoun, M. V. (2020). Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users. Association for Computing Machiner, 3-7. doi:arXiv:2007.06823</div>Nbvadivehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=RobertaRoberta2020-11-29T08:10:58Z<p>P2torabi: </p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the Natural Language Processing (NLP) domain like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements in a variety of NLP tasks. However, it is difficult to determine which parts of the methods contribute most to their success. This paper proposed Roberta, a model which replicates BERT pretraining, which investigates the effects of hyperparameters tuning and training set size. In summary, the main contributions of this paper are:<br />
<br />
(1) Introducing the alternatives in design choices and training schemes of BERT, leading to better downstream task performance.<br />
<br />
(2) Introducing a new dataset and validating the positive effect of using more data for pretraining on the performance of downstream tasks.<br />
<br />
These 2 modification categories improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they gave an overview of BERT since they used this architecture in their model. The architecture of BERT is similar to the transformer encoder architecture, and the tuning processes contain an unsupervised feature-based approach and unsupervised fine-tuning approach. Very briefly, the transformer architecture defines attention over the embeddings in a layer such that the feedforward weights are a function of the embeddings at any given layer - ie are dynamic. BERT takes concatenated sequences of tokens as input. The two sequences are of length M (x1, x2,....xn) and N (y1, y2,....yn) where M + N < T (max length of input sequence during training). These sequences are concatenated along with special tokens [CLS] to specify the start of the new sequence, [SEP] specifying the separation of sequence, and [EOS] specifying the end of the input sequence. BERT uses transformer architecture[6] with two training objectives: they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects 80% of the tokens in the input sequence and replaces them with the special token [MASK] to prevent the model from cheating. MLM also replaces 10% of the remaining tokens with some random tokens from the vocabulary leaving the remaining ones unchanged. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for predicting whether the two sentences are adjacent to each other. Positive sequences are selected by taking consecutive sentences from the corpus text. It is noteworthy that while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative example. They used Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Experimental Setup ==<br />
=== Implementation ===<br />
They primarily follow the original BERT optimization hyperparameters, except for the peak learning rate and a number of warmup steps, which are tuned separately for each setting. They found the training to be very sensitive to the Adam epsilon term, and in some cases, they obtained better performance or improved stability after tuning it. They also set β2 = 0.98, a hyperparameter from the AdamW optimizer [10], to improve stability when training with large batch sizes.<br />
<br />
=== Data ===<br />
They consider five English-language corpora of varying sizes and domains, totaling over 160GB of uncompressed text: BOOKCORPUS(Zhu et al., 2015) plus English WIKIPEDIA, which is the original data used to train BERT (16GB); CC-NEWS, which they collect from the English portion of the CommonCrawl News dataset (Nagel, 2016), containing 63 million English news articles crawled between September 2016 and February 2019 (76GB after filtering);OPEN-WEBTEXT(Gokaslan & Cohen, 2019), an open-source recreation of the WebText corpus described in Radford et al. (2019), containing web content extracted from URLs shared on Reddit with at least three upvotes (38GB);5(5) STORIES, a dataset introduced in Trinh & Le (2018) containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas (31GB).<br />
<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. The author compared four factors: different masks - pre-train methods, batch sizes, and tokenization methods - to optimize the RoBERTa model.<br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As mentioned in the previous section, the masked language modeling objective in BERT pre-training masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
Next, they tried to investigate the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units and repeating characters. The authors also used a vocabulary size of 50k rather than 30k as the original BERT implementation did, thus increasing the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
<div align="center">'''Table 5:''' RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row. </div><br />
<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
[[File:CaptureRoberta.PNG|400px|center]]<br />
<br />
<br />
Table 3 presents the results of the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.<br />
<br />
== Conclusion ==<br />
The results confirmed that employing large batches over more data along with longer training time improves the performance. In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT. Another question that we should know about Roberta model or any language model can handle is, have the language models in general successfully acquired commonsense reasoning, or are we overestimating the true capabilities of machine commonsense?. The WINOGRANDE [9] is a large-scale dataset the contains 44k problems, inspired by Winograd Schema Challenge (WSC) design. The main key steps of constructing this dataset are the crowdsourcing procedure followed by systematic bias reduction that generalizes human-detectable word associations to machine-detectable embedding associations. I am curious to test this dataset on Roberta and BERT to see how much improvement they have done.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In the North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In the North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]<br />
<br />
[9] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641.<br />
<br />
[10] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.</div>Dmalekihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_RepresentationsSelf-Supervised Learning of Pretext-Invariant Representations2020-11-28T20:13:29Z<p>P2torabi: </p>
<hr />
<div>==Authors==<br />
<br />
Ishan Misra, Laurens van der Maaten<br />
<br />
== Presented by == <br />
Sina Farsangi<br />
<br />
== Introduction == <br />
<br />
Modern image recognition and object detection systems find image representations by using a large number of data points with pre-defined semantic annotations. Examples of these annotations include class labels [1] and bounding boxes [2], as shown in Figure 1. There is a need for a large amount of labeled data, which is often very difficult to obtain. Also, these systems usually learn specific features for a particular type of class and not necessarily semantically meaningful features that can help generalize to other domains and classes. In other words, '''pre-defined semantic annotations scale poorly to the long tail of visual concepts'''[3]. Therefore, there has been a big interest in the community for learning image representations that are more visually meaningful and can help in a variety of tasks, such as image recognition and object detection. One of the fast-growing areas of research that tries to address this problem is '''self-supervised Learning'''. Self-Supervised Learning tries to learn meaningful semantics by just using the inputs themselves rather than using pre-defined semantic annotated data. As will show, the self-supervised learning paradigm removes the need for using human-provided class labels or bounding boxes for classification and object detection tasks, respectively. <br />
<br />
[[File: SSL_1.JPG | 800px | center]]<br />
<div align="center">'''Figure 1:''' Semantic Annotations used for finding image representations: a) Class labels and b) Bounding Boxes </div><br />
<br />
Self-Supervised Learning is often done using a set of tasks called '''pretext tasks'''. During these tasks, a transformation <math> \tau </math> is applied to unlabeled images <math> I </math> to obtain a set of transformed images, <math> I^{t} </math>. Then, a deep neural network, <math> \phi(\theta) </math>, is trained to predict some characteristic of the transformation from the transformed image. Several pretext tasks exist based on the type of transformation used. For example, if a neural network can accurately determine if an image is upside down or not, then perhaps it has learned some semantically meaningful representation of the image. This pre-empts the need for human-provided labels. Two of the most common pretext tasks used are rotations and jigsaw puzzle [4,5,6]. As shown in Figure 2, in the rotation task, unlabeled images, <math> </math> are rotated by random degrees (0,90,180,270) and the deep network learns to predict the rotation degree. The jigsaw task is more complicated than the rotation prediction task; first unlabeled images are cropped into 9 patches, then the image is perturbed by randomly permuting the nine patches. The unlabeled original image is referred to as the anchor data point (Figure 3-a), the reshuffled image that we get by permuting patches will be our positive sample (Figure 3-b) and the rest of the images in the dataset will be considered as negative samples. Each permutation falls into one of the 35 classes according to a formula given by the authors. A deep network is then trained to predict the class of the permutation of the patches in the perturbed image. Some other tasks include colorization, where the model tries to revert the colors of a colored image turned to grayscale, and image reconstruction where a square chunk of the image is deleted and the model tries to reconstruct that part. <br />
<br />
[[File: SSL_2.JPG |1000px | center]]<br />
<div align="center">'''Figure 2:''' Self-Supervised Learning using Rotation and Jigsaw Pretext Tasks </div><br />
<br />
[[File:figure3.jpg |600px | center]]<br />
<div align="center">'''Figure 3:''' Jigsaw puzzle used as a pretext task in unsupervised representation learning. (a) Original image (b) augmented image </div><br />
Although the proposed pretext tasks have achieved promising results, they have the disadvantage of being covariant to the applied transformation. In other words, as deep networks are trained to predict transformation characteristics, they will also learn representations that will vary based on the applied transformation. By intuition, we would like to obtain representations that are common between the original images and the transformed ones. This idea is supported by the fact that humans can recognize these transformed images. For example, a human can identify a permuted image of a tiger (as in figure 3) as a "permuted tiger" as well as the original image as a "tiger". Thus, the "tiger" aspect of the representations human learn is invariant to the transform, which cannot be taken for granted in standard self-supervision. The paper tries to address this problem by introducing '''Pretext Invariant Representation Learning''' (PIRL) that obtains representations which are transformation invariant and therefore more semantically meaningful. The performance of the proposed method is evaluated on several self-supervision learning benchmarks. The results show that the PIRL introduces a new state-of-the-art method in self-supervised Learning by learning transformation invariant representations.<br />
<br />
== Problem Formulation and Methodology ==<br />
<br />
[[File: SSL_3.JPG | 800px | center]]<br />
<div align="center">'''Figure 3:''' Overview of Standard Pretext Learning and Pretext-Invariant Representation Learning (PIRL). </div><br />
<br />
<br />
An overview of the proposed method and a comparison with Pretext Tasks are shown in Figure 3. For a given image , <math>I</math>, in the dataset of unlabeled images, <math> D=\{{I_1,I_2,...,I_{|D|}}\} </math>, a transformation <math> \tau </math> is applied: <br />
<br />
\begin{align} \tag{1} \label{eqn:1}<br />
I^t=\tau(I)<br />
\end{align}<br />
<br />
Where <math>I^t</math> is the transformed image. We would like to train a convolutional neural network, <math>\phi(\theta)</math>, that constructs image representations <math>v_{I}=\phi_{\theta}(I)</math>. Pretext Task based methods learn to predict transformation characteristics, <math>z(t)</math>, by minimizing a transformation covariant loss function in the form of:<br />
<br />
\begin{align} \tag{2} \label{eqn:2}<br />
l_{\text{cov}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,z(t))<br />
\end{align}<br />
<br />
As it can be seen, the loss function covaries with the applied transformation and therefore, the obtained representations may not be semantically meaningful. PIRL tries to solve for this problem as shown in Figure 3. The original and transformed images are passed through two parallel convolutional neural networks to obtain two sets of representations, <math>v(I)</math> and <math>v(I^t)</math>. Then, a contrastive loss function is defined to ensure that the representations of the original and transformed images are similar to each other. The transformation invariant loss function can be defined as:<br />
<br />
\begin{align} \tag{3} \label{eqn:3}<br />
l_{\text{inv}}(\theta,D)=\frac{1}{|D|} \sum_{I \in {D}}^{} L(v_I,v_{I^t})<br />
\end{align}<br />
<br />
Where L is a contrastive loss based on Noise Contrastive Estimators (NCE). The NCE function can be shown as below: <br />
<br />
\begin{align} \tag{4} \label{eqn:4}<br />
h(v_I,v_{I^t})=\frac{\exp \biggl( \frac{s(v_I,v_{I^t})}{\tau} \biggr)}{\exp \biggl(\frac{s(v_I,v_{I^t})}{\tau} \biggr) + \sum_{I^{'} \in D_N}^{} \exp \biggl( \frac{s(v_{I^t},v_{I^{'}})}{\tau} \biggr)}<br />
\end{align}<br />
<br />
where <math>s(.,.)</math> is the cosine similarity function and <math>\tau</math> is the temperature parameter that is usually set to 0.07. Also, a set of N images are chosen randomly from the dataset where <math>I^{'}\neq I</math>. These images are used in the loss in order to ensure their representation dissimilarity with transformed image representations. Also, during model implementation, two heads (few additional deep layers), <math>f</math> and <math>g</math>, are applied on top of <math>v(I)</math> and <math>v(I^t)</math>. Using the NCE formulation, the contrastive loss can be written as:<br />
<br />
\begin{align} \tag{5} \label{eqn:5}<br />
L_{\text{NCE}}(I,I^{t})=-\text{log}[h(f(v_I),g(v_{I^t}))]-\sum_{I^{'}\in D_N}^{} \text{log}[1-h(g(v_{I^t}),f(v_{I^{'}}))]<br />
\end{align}<br />
<br />
[[File: SSL_4.JPG | 800px | center]]<br />
<div align="center">'''Figure 4:''' Proposed PIRL </div><br />
<br />
Although the formulation looks complicated, the take-away here is that by minimizing the NCE based loss function, the similarity between the original and transformed image representations, <math>v(I)</math> and <math>v(I^t)</math>, increases and at the same time the dissimilarity between <math>v(I^t)</math> and negative images representations, <math>v(I^{'})</math>, is increased. According to the previous work, an infeasibly large batch size is needed to obtain a large number of negatives. To tackle this problem, a memory bank [9], <math>M</math>, is used during training which contains feature representation <math>m_I</math> for each image in the dataset including the negative images. The proposed PIRL model is shown in Figure 4. Finally, the contrastive loss in equation \eqref{eqn:5} does not take into account the dissimilarity between the original image representations, <math>v(I)</math>, and the negative image representations, <math>v(I^{'})</math>. By taking this into account and using the memory bank, the final contrastive loss function is obtained as:<br />
<br />
\begin{align} \tag{6} \label{eqn:6}<br />
L(I,I^{t})=\lambda L_{\text{NCE}}(m_I,g(v_{I^t})) + (1-\lambda)L_{\text{NCE}}(m_I,f(v_{I}))<br />
\end{align}<br />
where <math>\lambda</math> is a hyperparameter that determines the weight of each of NCE losses. The default value for this parameter is 0.5. In the next section, experimental results are shown using the proposed PIRL model.<br />
<br />
==Experimental Results ==<br />
<br />
For the experiments in this section, PIRL is implemented using jigsaw transformations. The combination of PIRL with other types of transformations is shown in the last section of the summary. The quality of image representations obtained from PIRL Self-Supervised Learning is evaluated by comparing its performance to other Self-Supervised Learning methods on image recognition and object detection tasks. For the experiments, a ResNet50 model is trained using PIRL and other methods by using 1.28M randomly sampled images from the ImageNet dataset. Also, the number of negative images used for PIRL is N=32000. <br />
<br />
===Object Detection===<br />
<br />
A Faster R-CNN model with a ResNet-50 backbone, pre-trained using PIRL and other Self-Supervised methods, is employed for the object detection task. Then, the pre-trained model weights are used as initial weights for the Faster-RCNN model backbone during training on the VOC07+12 dataset. The result of object detection using PIRL is shown in Figure (5) and it is compared to other methods. It can be seen that PIRL not only outperforms other Self-Supervised-based methods, '''for the first time it outperforms Supervised Pre-training on object detection'''. They emphasize that PIRL achieves this result using the same backbone model, the same number of finetuning epochs, and the exact same pre-training data (but without the labels). This result is a substantial improvement over prior self-supervised approaches that obtain worse performance than fully supervised baselines despite using orders of magnitude more curated training data or much larger backbone models. <br />
<br />
[[File: SSL_5.PNG | 800px | center]]<br />
<div align="center">'''Figure 5:''' Object detection on VOC07+12 using Faster R-CNN and comparing the Average Precision (AP) of detected bounding boxes. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
===Image Classification with linear models===<br />
<br />
In the next experiment, the performance of the PIRL is evaluated on image classification using four different datasets. For this experiment, the pre-trained ResNet-50 model is utilized as an image feature extractor. Then, a linear classifier is trained on fixed image representations. The results are shown in Figure (6). The results demonstrate that while PIRL substantially outperforms other Self-Supervised Learning methods, it still falls behind Supervised Pre-trained Learning. <br />
<br />
[[File: SSL_6.PNG | 800px | center]]<br />
<div align="center">'''Figure 6:''' Image classification with linear models. (The values for the blank spaces are not mentioned in the corresponding paper.) </div><br />
<br />
Overall, from Figure6, we can observe that PIRL has the best performance among different Self-Supervised Learning methods. Moreover, PIRL can even perform better than the Supervised Learning Pretrained model on object detection. This is because PIRL learns representations that are invariant to the applied transformations which results in more semantically meaningful and richer visual features. In the next section, some analysis on PIRL is presented.<br />
<br />
==Analysis==<br />
<br />
===Does PIRL learn invariant representations?===<br />
<br />
In order to show that the image representations obtained using PIRL are invariant, several images are chosen from the ImageNet dataset, and representations of the chosen images and their transformed version are obtained using one-time PIRL and another time the jigsaw pretext task which is the transformation covariant version of PIRL. Then, for each method, the L2 norm between the original and transformed image representations are computed and their distributions are plotted in Figure (7). It can be seen that PIRL results in more similarity between the original and transformed image representations. Therefore, PIRL learns invariant representations. <br />
<br />
[[File: SSL_7.PNG | 800px | center]]<br />
<div align="center">'''Figure 7:''' Invariance of PIRL representations. </div><br />
<br />
===Which layer produces the best representation?===<br />
Figure 12 studies the quality of representations in earlier layers of the convolutional networks. The figure reveals that the quality of Jigsaw representations improves from the conv1 to the res4 layer but that their quality sharply decreases in the res5 layer. By contrast, PIRL representations are invariant to image transformations and the best image representations are extracted from the res5 layer of PIRL-trained networks.<br />
<br />
[[File: Paper29_SSL.PNG | 800px | center]]<br />
<div align="center">'''Figure 12:'''Quality of PIRL representations per layer. </div><br />
<br />
===What is the effect of <math>\lambda</math> in the PIRL loss function?===<br />
<br />
In order to investigate the effect of <math>\lambda</math> on PIRL representations, the authors obtained the accuracy of image recognition on ImageNet dataset using different values for <math>\lambda</math> in PIRL. As shown in Figure 8, the results show that the value of <math>\lambda</math> affects the performance of PIRL and the optimum value for <math>\lambda</math> is 0.5. <br />
<br />
[[File: SSL_8.PNG | 800px | center]]<br />
<div align="center">'''Figure 8:''' Effect of varying the parameter <math>\lambda</math> </div><br />
<br />
===What is the effect of the number of images transforms?===<br />
<br />
As another experiment, the authors investigated the number of image transforms and their effect on PIRL performance. There is a limitation on the number of transformations that can be applied using the jigsaw pretext method as this method has to predict the permutation of the patches and the number of the parameters in the classification layer grows linearly with the number of used transformations. However, PIRL can use all number of image transformations which is equal to <math>9! \approx 3.6\times 10^5</math>. Figure (9) shows the effect of changing the number of patch permutations on PIRL and jigsaw. The results show that increasing the number of permutations increases the mean Average Precision (mAP) of PIRL on image classification using the VOCC07 dataset. <br />
<br />
[[File: SSL_9.PNG | 800px | center]]<br />
<div align="center">'''Figure 9:''' Effect of varying the number of patch permutations </div><br />
<br />
===What is the effect of the number of negative samples?===<br />
<br />
In order to investigate the effect of negative samples number, N, on PIRL's performance, the image classification accuracy is obtained using ImageNet dataset for a variety of values for N. As it is shown in Figure 10, increasing the number of negative sample results in richer image representations and higher classification accuracy. <br />
<br />
[[File: SSL_10.PNG | 800px | center]]<br />
<div align="center">'''Figure 10:''' Effect of varying the number of negative samples </div><br />
<br />
==Generalizing PIRL to Other Pretext Tasks==<br />
<br />
The used PIRL model in this paper used jigsaw permutations as the applied transformation to the original image. However, PIRL is generalizable to other Pretext Tasks. To show this, first, PIRL is used with rotation transformations and the performance of rotation-based PIRL is compared to the covariant rotation Pretext Task. The results in Figure (11) show that using PIRL substantially increases the classification accuracy on four datasets in comparison with the rotation Pretext Task. Next, both jigsaw and rotation transformations are used with PIRL to obtain image representations. The results show that combining multiple transformations with PIRL can further improve the accuracy of the image classification task. <br />
<br />
[[File: SSL_11.PNG | 800px | center]]<br />
<div align="center">'''Figure 11:''' Using PIRL with (combinations of) different pretext tasks </div><br />
<br />
==Conclusion==<br />
<br />
In this paper, a new state-of-the-art Self-Supervised learning method, PIRL, was presented. The proposed model learns to obtain features that are common between the original and transformed images, resulting in a set of transformation invariant and more semantically meaningful features. This is done by defining a contrastive loss function between the original images, transformed images, and a set of negative images. The results show that PIRL image representation is richer than previously proposed methods, resulting in higher accuracy and precision on image classification and object detection tasks.<br />
<br />
==Critiques==<br />
<br />
The paper proposes a very nice method for obtaining transformation invariant image representations. However, the authors can extend their work with a richer set of transformations. Also, it would be a good idea to investigate the combination of PIRL with clustering-based methods [7,8]. One of the clustering-based methods is '''DeepCluster''' [7], where each previous version of its representation is used by bootstrapping to produce a target for the next representation. They built a new representation through clustering data points using the prior representation and then classify the target by using the clustered index of each sample. That may result in better image representations. This will avoid the use of negative pairs but it might also cause collapsing to trivial solutions which create a trade-off. <br />
<br />
It could be better if they could visualize their network weights and compare them to the other supervised methods for the deeper layers that extract high-level information.<br />
<br />
== Source Code ==<br />
<br />
https://paperswithcode.com/paper/self-supervised-learning-of-pretext-invariant<br />
<br />
== References ==<br />
<br />
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.<br />
<br />
[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015. <br />
<br />
[3] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint, 2017<br />
<br />
[4] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.<br />
<br />
[5] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[6] Jong-Chyi Su, Subhransu Maji, Bharath Hariharan. When does self-supervision improve few-shot learning? European Conference on Computer Vision, 2020.<br />
<br />
[7] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.<br />
<br />
[8] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.<br />
<br />
[9] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.</div>Sfarsanghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Mask_RCNNMask RCNN2020-11-27T18:44:10Z<p>S564li: /* Visual Perception Tasks */</p>
<hr />
<div>== Presented by == <br />
Qing Guo, Xueguang Ma, James Ni, Yuanxin Wang<br />
<br />
== Introduction == <br />
Mask RCNN (Region Based Convolutional Neural Networks)[1] is a deep neural network architecture that aims to solve instance segmentation problems in computer vision which is important when attempting to identify different objects within the same image by identifying object's bounding box and classes.It combines elements from classical computer vision of object detection and semantic segmentation. RCNN base architectures first extract a regional proposal (a region of the image where the object of interest is proposed to lie) and then attempts to classify the object within it. <br />
Mask R-CNN extends Faster R-CNN [2] by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. This is done by using a Fully Convolutional Network as each mask branch in a pixel-by-pixel way. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. Mask R-CNN achieved top results in all three tracks of the COCO suite of challenges [3], including instance segmentation, bounding-box object detection, and person keypoint detection.<br />
<br />
== Visual Perception Tasks == <br />
<br />
Figure 1 shows a visual representation of different types of visual perception tasks:<br />
<br />
- Image Classification: Predict a set of labels to characterize the contents of an input image<br />
<br />
- Object Detection: Localizing objects in an image by building bounding boxes around the object, then proceed to classify them according to a set of classes. The current baseline system for object detection is Fast/Faster R-CNN.<br />
<br />
- Semantic Segmentation: Associate every pixel in an input image with a class label. The common baseline system for semantic segmentation is FCN (Fully Convolutional Network).<br />
<br />
- Instance Segmentation: Associate every pixel in an input image to a specific object. Instance segmentation combines object detection and semantic segmentation as a complex task, to obtain both correct detection and precise segmentation together.<br />
<br />
[[File:instance segmentation.png | center]]<br />
<div align="center">Figure 1: Visual Perception tasks</div><br />
<br />
<br />
Mask RCNN is a deep neural network architecture combining multiple state-of-art techniques for the task of Instance Segmentation.<br />
<br />
== Related Architecture to Mask RCNN == <br />
Region Proposal Network: A Region Proposal Network (RPN) proposes candidate object bounding boxes, which is the first step for effective object detection. It takes an image (of any size) as input and outputs a set of rectangular object boxes, each with an objectness score.<br />
<br />
ROI Pooling: The main use of ROI (Region of Interest) Pooling is to adjust the proposal to a uniform size. It’s better for the subsequent network to process. It maps the proposal to the corresponding position of the feature map, divide the mapped area into sections of the same size, and performs max pooling or average pooling operations on each section. ROI pooling solves the problem of fixed image size requirement for object detection network. The entire image feeds a CNN model to detect RoI on the feature maps.<br />
<br />
[[File:ROI pooling.png | center | 300px]]<br />
<br />
<br />
Faster R-CNN: Faster R-CNN consists of two stages: Region Proposal Network and Fast R-CNN using ROI Pooling. Region Proposal Network proposes candidate object bounding boxes. ROI Pooling, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference.<br />
<br />
[[File:FasterRCNN.png | center]]<br />
<div align="center">Figure 2: Faster RCNN architecture</div><br />
<br />
<br />
ResNet-FPN: FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. FPN is a general architecture that can be used in conjunction with various networks, such as VGG, ResNet, etc. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale. Other than FPN, the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed.<br />
<br />
[[File:ResNetFPN.png | center]]<br />
<div align="center">Figure 3: ResNetFPN architecture</div><br />
<br />
== Model Architecture == <br />
The structure of mask R-CNN is quite similar to the structure of faster R-CNN. <br />
Faster R-CNN has two stages, the RPN(Region Proposal Network) first proposes candidate object bounding boxes. Then RoIPool extracts the features from these boxes. After the features are extracted, these features data can be analyzed using classification and bounding-box regression. Mask R-CNN shares the identical first stage. But the second stage is adjusted to tackle the issue of simplifying the stages pipeline. Instead of only performing classification and bounding-box regression, it also outputs a binary mask for each RoI as <math>L=L_{cls}+L_{box}+L_{mask}</math>, where <math>L_{cls}</math>, <math>L_{box}</math>, <math>L_{mask}</math> represent the classification loss, bounding box loss and the average binary cross-entropy loss respectively.<br />
<br />
The important concept here is that, for most recent network systems, there's a certain order to follow when performing classification and regression, because classification depends on mask predictions. Mask R-CNN, on the other hand, applies bounding-box classification and regression in parallel, which effectively simplifies the multi-stage pipeline of the original R-CNN. And just for comparison, complete R-CNN pipeline stages involve 1. Make region proposals; 2. Feature extraction from region proposals; 3. SVM for object classification; 4. Bounding box regression. In conclusion, stages 3 and 4 are adjusted to simplify the network procedures.<br />
<br />
The system follows the multi-task loss, which by formula equals classification loss plus bounding-box loss plus the average binary cross-entropy loss.<br />
One thing worth noticing is that for other network systems, those masks across classes compete with each other. However, in this particular case with a <br />
per-pixel sigmoid and a binary loss the masks across classes no longer compete, it makes this formula the key for good instance segmentation results.<br />
<br />
=== RoIAlign ===<br />
<br />
This concept is useful in stage 2 where the RoIPool extracts features from bounding-boxes. For each RoI as input, there will be a mask and a feature map as output. The mask is obtained using the FCN(Fully Convolutional Network) and the feature map is obtained using the RoIPool. The mask helps with spatial layout, which is crucial to the pixel-to-pixel correspondence. <br />
<br />
The two things we desire along the procedure are: pixel-to-pixel correspondence; no quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points. Pixel-to-pixel correspondence makes sure that the input and output match in size. If there is a size difference, there will be information loss, and coordinates cannot be matched. <br />
<br />
RoIPool is standard for extracting a small feature map from each RoI. However, it performs quantization before subdividing into spatial bins which are further quantized. Quantization produces misalignments when it comes to predicting pixel accurate masks. Therefore, instead of quantization, the coordinates are computed using bilinear interpolation They use bilinear interpolation to get the exact values of the inputs features at the 4 RoI bins and aggregate the result (using max or average). These results are robust to the sampling location and number of points and to guarantee spatial correspondence.<br />
<br />
The network architectures utilized are called ResNet and ResNeXt. The depth can be either 50 or 101. ResNet-FPN(Feature Pyramid Network) is used for feature extraction. <br />
<br />
Some implementation details should be mentioned: first, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. It is important because the mask loss Lmask is defined only on positive RoIs. Second, image-centric training is used to rescale images so that pixel correspondence is achieved. An example complete structure is, the proposal number is 1000 for FPN, and then run the box prediction branch on these proposals. The mask branch is then applied to the highest scoring 100 detection boxes. The mask branch can predict K masks per RoI, but only the kth mask will be used, where k is the predicted class by the classification branch. The m-by-m floating-number mask output is then resized to the RoI size and binarized at a threshold of 0.5.<br />
<br />
== Results ==<br />
[[File:ExpInstanceSeg.png | center]]<br />
<div align="center">Figure 4: Instance Segmentation Experiments</div><br />
<br />
Instance Segmentation: Based on COCO dataset, Mask R-CNN outperforms all categories comparing to MNC and FCIS which are state of the art model <br />
<br />
[[File:BoundingBoxExp.png | center]]<br />
<div align="center">Figure 5: Bounding Box Detection Experiments</div><br />
<br />
Bounding Box Detection: Mask R-CNN outperforms the base variants of all previous state-of-the-art models, including the winner of the COCO 2016 Detection Challenge.<br />
<br />
== Ablation Experiments ==<br />
[[File:BackboneExp.png | center]]<br />
<div align="center">Figure 6: Backbone Architecture Experiments</div><br />
<br />
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet. <br />
<br />
[[File:MultiVSInde.png | center]]<br />
<div align="center">Figure 7: Multinomial vs. Independent Masks Experiments</div><br />
<br />
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).<br />
<br />
[[File: RoIAlign.png | center]]<br />
<div align="center">Figure 8: RoIAlign Experiments 1</div><br />
<br />
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers. <br />
<br />
[[File: RoIAlignExp.png | center]]<br />
<div align="center">Figure 9: RoIAlign Experiments w Experiments</div><br />
<br />
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features, resulting in big accuracy gaps.<br />
<br />
[[File:MaskBranchExp.png | center]]<br />
<div align="center">Figure 10: Mask Branch Experiments</div><br />
<br />
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.<br />
<br />
== Human Pose Estimation ==<br />
Mask RCNN can be extended to human pose estimation.<br />
<br />
The simple approach the paper presents is to model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types such as left shoulder, right elbow. The model has minimal knowledge of human pose and this example illustrates the generality of the model. Mask R-CNN does this by viewing each keypoint as a one-hot binary mask and applying minimal modification to detect instance-specific poses.<br />
<br />
[[File:HumanPose.png | center]]<br />
<div align="center">Figure 11: Keypoint Detection Results</div><br />
<br />
== Experiments on Cityscapes ==<br />
The model was also tested on Cityscapes dataset. From this dataset the authors used 2975 annotated images for training, 500 for validation, and 1525 for testing. The instance segmentation task involved eight categories: person, rider, car, truck, bus, train, motorcycle and bicycle. When the Mask R-CNN model was applied to the data it achieved 26.2 AP on the testing data which was an over 30% improvement on the previous best entry. <br />
<br />
<center><br />
[[ File:cityscapeDataset.png ]]<br />
<br />
<br />
Figure 12. Cityscapes Results<br />
</center><br />
<br />
== Conclusion ==<br />
Mask RCNN is a deep neural network aimed to solve the instance segmentation problems in machine learning or computer vision. Mask R-CNN is a conceptually simple, flexible, and general framework for object instance segmentation. It can efficiently detect objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It does object detection and instance segmentation, and can also be extended to human pose estimation.<br />
It extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.<br />
<br />
== Critiques ==<br />
In Faster RCNN, the ROI boundary is quantized. However, mask RCNN avoids quantization and used the bilinear interpolation to compute exact values of features. By solving the misalignments due to quantization, the number and location of sampling points have no impact on the result.<br />
<br />
It may be better to compare the proposed model with other NN models or even non-NN methods like spectral clustering. Also, the applications can be further discussed like geometric mesh processing and motion analysis.<br />
<br />
The paper lacks the comparisons of different methods and Mask RNN on unlabeled data, as the paper only briefly mentioned that the authors found out that Mask R_CNN can benefit from extra data, even if the data is unlabelled.<br />
<br />
The Mask RCNN has many practical applications as well. A particular example, where Mask RCNNs are applied would be in autonomous vehicles. Namely, it would be able to help with isolating pedestrians, other vehicles, lights, etc.<br />
<br />
The Mask RCNN could be a candidate model to do short-term predictions on the physical behaviors of a person, which could be very useful at crime scenes.<br />
<br />
For the most part, instance segmentation is now quite achievable, and it’s time to start thinking about innovative ways of using this idea of doing computer vision algorithms at a pixel by pixel level such as the DensePose algorithm. <br />
<br />
An interesting application of Mask RCNN would be on face recognition from CCTVs. Flurry pictures of crowded people could be obtained from CCTV, so that mask RCNN can be applied to distinguish each person.<br />
<br />
The main problem for CNN architectures like Mask RCNN is the running time. Due to slow running times, Single Shot Detector algorithms are preferred for applications like video or live stream detections, where a faster running time would mean a better response to changes in frames. It would be beneficial to have a graphical representation of the Mask RCNN running times against single shot detector algorithms such as YOLOv3.<br />
<br />
It is interesting to investigate a solution of embedding instance segmentation with semantic segmentation to improve time performance. Because in many situations, knowing the exact boundary of an object is not necessary.<br />
<br />
<br />
It will be better if we can have more comparisons with other models. It will also be nice if we can have more details about why Mask RCNN can perform better, and how about the efficiency of it?<br />
The authors mentioned that Mask R-CNN is a deep neural network architecture for Instance Segmentation. It's better to include more background information about this task. For example, challenges of this task (e.g. the model will need to take into account the overlapping of objects) and limitations of existing methods.<br />
<br />
It would be interesting to see how a postprocessing step with conditional random fields (CRF) might improve (or not?) segmentation. It would also have been interesting to see the performance of the method with lighter backbones since the backbones used to have a very large inference time which makes them unsuitable for many applications.<br />
<br />
An extension of the application of Mask RCNN in medical AI is to highlight areas of an MRI scan that correlate to certain behavioral/psychological patterns.<br />
<br />
The use of these in medical imaging systems seems rather useful, but it can also be extended to more general CCTV camera systems which can also detect physiological patterns.<br />
<br />
In the Human Pose Estimation section, we assume that Mask RCNN does not have any knowledge of human poses, and all the predictions are based on keypoints on human bodies, for example, left shoulder and right elbow. While in fact we may be able to achieve better performances here because currently this approach is strongly dependent on correct classifications of human body parts. That is, if the model messed up the position of left shoulder, the position estimation will be awful. It is important to remove the dependency on preceding predictions, so that even when previous steps fail, we may still expect a fair performance.<br />
<br />
It will be interesting to see if applying dropout can boost this Mask RCNN architecture's performance.<br />
<br />
It will be interesting if mask RCNN is applied to human faces and how it classify each individual also would be nice to see how the technical calculations such as classification and predictions are done.<br />
<br />
It would be interesting to know how the RCNN model will perform on unbalanced data and how the performance compares with other models in this circumstances.<br />
<br />
The authors omitted the details of the training and the computational cost of training the model. Since RCNN combines stages 3 and 4 (SVM to categorize and bounding box regression), how does this affect the computational cost of the model? Similar architectures to the RCNN have long training times so it is of interest to know the computational runtime of this model in comparison to other models.<br />
<br />
It's amazing what these researchers were able to achieve with adding minimal overhead, and how well it generalizes using two completely different datasets. For the future work it would be nice to see if the model is able to also predict the distance between the objects that overlap in an image, without adding any further significant overhead. <br />
<br />
Additionally, it would be nice to see how well the model is able to predict collision detection between the objects given that it is currently at 5 frames-per-second (which is still really impressive, it just would be interesting to see how much would be possible)<br />
<br />
It would be more helpful if metrics of measuring classification errors are introduced briefly. Otherwise, it can be confusing.<br />
<br />
For the Mask R-CNN model can be used for Human Pose Estimation, can it be further trained to adapt the technology to dangerous workplace training simulations? For example, if a window cleaning worker in training, the model can be used to estimate its posture with its respective data sent to another pipeline for estimating danger levels or correctness of a worker's posture. Secondly, since it can estimate a human's posture, can it be used to estimate on-site approaching live traffic to aid autonomous driving? Also, even though we are seeing it built from ResNet variants and backbones, why not provide an actual visualized comparison between the original ResNet and Mask RCNN to show improvements (tables were given, but what about a graph to show the training process accuracy and loss)?<br />
<br />
== Interesting Directions ==<br />
<br />
There is recent work on ResNeSt: Split-Attention Networks (https://arxiv.org/abs/2004.08955), which uses an explicit soft attention mechanism over channels within a ResNeXt style architecture which shows improvements to classification. It would be interesting to use this backbone with Mask R-CNN and see if the attention helps capture longer range dependencies and thus produce better segmentations.<br />
<br />
== References ==<br />
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. arXiv:1703.06870, 2017.<br />
<br />
[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv:1506.01497, 2015.<br />
<br />
[3] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. Microsoft COCO: Common Objects in Context. arXiv:1405.0312, 2015</div>X93mahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectorsImproving neural networks by preventing co-adaption of feature detectors2020-11-27T02:41:52Z<p>J27ni: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim<br />
<br />
= Improvement Intro =<br />
'''Drop Out Model'''<br />
<br />
In this paper, Hinton et al. introduce a novel way to improve neural networks’ performance, particularly in the case that a large feedforward neural network is trained on a small training set, which causes poor performance and leads to an “overfitting” problem. This problem can be reduced by randomly omitting half of the feature detectors on each training case. In fact, By omitting neurons in hidden layers with a probability of 0.5, each hidden unit is prevented from relying on other hidden units being present during training. Hence there are fewer co-adaptations among them on the training data. Called “dropout,” this process is also an efficient alternative to train many separate networks and average their predictions on the test set. <br />
<br />
The intuition for dropout is that if neurons are randomly dropped during training, they can no longer rely on their neighbours, thus allowing each neuron to become more robust. Another interpretation is that dropout is similar to training an ensemble of models since each epoch with randomly dropped neurons can be viewed as its own model. <br />
<br />
They used the standard, stochastic gradient descent algorithm and separated training data into mini-batches. An upper bound was set on the L2 norm of incoming weight vector for each hidden neuron, which was normalized if its size exceeds the bound. They found that using a constraint, instead of a penalty, forced the model to do a more thorough search of the weight-space, when coupled with the very large learning rate that decays during training.<br />
<br />
'''Mean Network'''<br />
<br />
Their dropout models included all of the hidden neurons, and their outgoing weights were halved to account for the chances of omission. This is called a 'Mean Network'. This is similar to taking the geometric mean of the probability distribution predicted by all <math>2^N</math> networks. Due to this cumulative addition, the correct answers will have a higher log probability than an individual dropout network, which also leads to a lower squared error of the network. <br />
<br />
'''Hidden Markov Models'''<br />
<br />
The Hidden Markov models are used to deal with temporal variability and the models require an acoustic model to determine how well acoustic inputs are fitted to each state of the model. With a neural network with 4 fully-connected hidden layers of 4000 units per layer and 185 "softmax" outputs units that subsequently merge into 39 distinct classes that present in the database. When a dropout rate of 50% is applied, the results improved from 22.5% to 19.7% which sets the record for a model that does not use speaker identity to correct acoustic processing.<br />
<br />
The models were shown to result in lower test error rates on several datasets: MNIST; TIMIT; Reuters Corpus Volume; CIFAR-10; and ImageNet.<br />
<br />
= MNIST =<br />
The MNIST dataset contains 70,000 digit images of size 28 x 28. To see the impact of dropout, they used 4 different neural networks (784-800-800-10, 784-1200-1200-10, 784-2000-2000-10, 784-1200-1200-1200-10), with the same dropout rates, 50%, for hidden neurons and 20% for visible neurons. Stochastic gradient descent was used with mini-batches of size 100 and a cross-entropy objective function as the loss function. Weights were updated after each minibatch, and training was done for 3000 epochs. An exponentially decaying learning rate <math>\epsilon</math> was used, with the initial value set as 10.0, and it was multiplied by the decaying factor <math>f</math> = 0.998 at the end of each epoch. At each hidden layer, the incoming weight vector for each hidden neuron was set an upper bound of its length, <math>l</math>, and they found from cross-validation that the results were the best when <math>l</math> = 15. Initial weights values were pooled from a normal distribution with mean 0 and standard deviation of 0.01. To update weights, an additional variable, ''p'', called momentum, was used to accelerate learning. The initial value of <math>p</math> was 0.5, and it increased linearly to the final value 0.99 during the first 500 epochs, remaining unchanged after. Also, when updating weights, the learning rate was multiplied by <math>1 – p</math>. <math>L</math> denotes the gradient of loss function.<br />
<br />
[[File:weights_mnist2.png|center|400px]]<br />
<br />
The best-published result for a standard feedforward neural network was 160 errors. This was reduced to about 130 errors with 0.5 dropout and different L2 constraints for each hidden unit input weight. By omitting a random 20% of the input pixels in addition to the aforementioned changes, the number of errors was further reduced to 110. The following figure visualizes the result.<br />
[[File:mnist_figure.png|center|500px]]<br />
A publicly available pre-trained deep belief net resulted in 118 errors, and it was reduced to 92 errors when the model was fine-tuned with dropout. Another publicly available model was a deep Boltzmann machine, and it resulted in 103, 97, 94, 93 and 88 when the model was fine-tuned using standard backpropagation and was unrolled. They were reduced to 83, 79, 78, 78, and 77 when the model was fine-tuned with dropout – the mean of 79 errors was a record for models that do not use prior knowledge or enhanced training sets.<br />
<br />
= TIMIT = <br />
<br />
TIMIT dataset includes voice samples of 630 American English speakers varying across 8 different dialects. It is often used to evaluate the performance of automatic speech recognition systems. Using Kaldi, the dataset was pre-processed to extract input features in the form of log filter bank responses.<br />
<br />
=== Pre-training and Training ===<br />
<br />
For pretraining, they pre-trained their neural network with a deep belief network and the first layer was built using Restricted Boltzmann Machine (RBM) which is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution <math>N(0, 0.01)</math>. Each visible neuron’s variance was set to 1.0 and remained unchanged.<br />
<br />
Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by <math>(1-momentum)</math> and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, <math>p</math> was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to <math>log(p/(1 − p))</math>. Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.<br />
<br />
=== Dropout tuning ===<br />
<br />
The initial weights were set in a neural network from the pre-trained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purposes, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.<br />
<br />
=== Classification Test and Performance ===<br />
<br />
A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input.<br />
<br />
Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Thus, neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.<br />
<br />
= Reuters Corpus Volume =<br />
Reuters Corpus Volume I archives 804,414 news documents that belong to 103 topics. Under four major themes - corporate/industrial, economics, government/social, and markets – they belonged to 63 classes. After removing 11 classes with no data and one class with insufficient data, they are left with 50 classes and 402,738 documents. The documents were divided into training and test sets equally and randomly, with each document representing the 2000 most frequent words in the dataset, excluding stopwords.<br />
<br />
They trained two neural networks, with size 2000-2000-1000-50, one using dropout and backpropagation, and the other using standard backpropagation. The training hyperparameters are the same as that in MNIST, but training was done for 500 epochs.<br />
<br />
In the following figure, we see the significant improvements by the model with dropout in the test set error. On the right side, we see that learning with dropout also proceeds smoother. <br />
<br />
[[File:reuters_figure.png|700px|center]]<br />
<br />
= CNN =<br />
<br />
Feed-forward neural networks consist of several layers of neurons where each neuron in a layer applies a linear filter to the input image data and is passed on to the neurons in the next layer. When calculating the neuron’s output, scalar bias a.k.a weights are applied to the filter with nonlinear activation function as parameters of the network that are learned by training data. [[File:cnnbigpicture.jpeg|thumb|upright=2|center|alt=text|Figure: Overview of Convolutional Neural Network]] There are several differences between Convolutional Neural networks and ordinary neural networks. The figure above gives a visual representation of a Convolutional Neural Network. First, CNN’s neurons are organized topographically into a bank and laid out on a 2D grid, so it reflects the organization of dimensions of the input data. Secondly, neurons in CNN apply filters which are local, and which are centered at the neuron’s location in the topographic organization. Meaning that useful metrics or clues to identify the object in an input image can be found by examining local neighborhoods of the image. Next, all neurons in a bank apply the same filter at different locations in the input image. When looking at the image example, green is input to one neuron bank, yellow is a filter bank, and pink is the output of one neuron bank (convolved feature). A bank of neurons in a CNN applies a convolution operation, aka filters, to its input where a single layer in a CNN typically has multiple banks of neurons, each performing a convolution with a different filter. The resulting neuron banks become distinct input channels into the next layer. The whole process reduces the net’s representational capacity but also reduces the capacity to overfit.<br />
[[File:bankofneurons.gif|thumb|upright=3|center|alt=text|Figure: Bank of neurons]]<br />
<br />
=== Pooling ===<br />
<br />
Pooling layer summarizes the activities of local patches of neurons in the convolutional layer by subsampling the output of a convolutional layer. Pooling is useful for extracting dominant features, to decrease the computational power required to process the data through dimensionality reduction. The procedure of pooling goes on like this; output from convolutional layers is divided into sections called pooling units and they are laid out topographically, connected to a local neighborhood of other pooling units from the same convolutional output. Then, each pooling unit is computed with some function which could be maximum and average. Maximum pooling returns the maximum value from the section of the image covered by the pooling unit while average pooling returns the average of all the values inside the pooling unit (see example). In the result, there are fewer total pooling units than convolutional unit outputs from the previous layer, this is due to the larger spacing between pixels on pooling layers. Using the max-pooling function reduces the effect of outliers and improves generalization. Other than that, overlapping pooling makes this spacing between pixels smaller than the size of the neighborhood that the pooling units summarize (This spacing is usually referred to as the stride between pooling units). With this variant, pooling layer can produce a coarse coding of the outputs which helps generalization. <br />
[[File:maxandavgpooling.jpeg|thumb|upright=2|center|alt=text|Figure: Max pooling and Average pooling]]<br />
<br />
=== Local Response Normalization === <br />
<br />
This network includes local response normalization layers which are implemented in lateral form and used on neurons with unbounded activations and permit the detection of high-frequency features with a big neuron response. This regularizer encourages competition among neurons belonging to different banks. Normalization is done by dividing the activity of a neuron in bank <math>i</math> at position <math>(x,y)</math> by the equation:<br />
[[File:local response norm.png|upright=2|center|]] where the sum runs over <math>N</math> ‘adjacent’ banks of neurons at the same position as in the topographic organization of neuron bank. The constants, <math>N</math>, <math>alpha</math> and <math>betas</math> are hyper-parameters whose values are determined using a validation set. This technique is replaced by better techniques such as the combination of dropout and regularization methods (<math>L1</math> and <math>L2</math>)<br />
<br />
=== Neuron nonlinearities ===<br />
<br />
All of the neurons for this model use the max-with-zero nonlinearity where output within a neuron is computed as <math> a^{i}_{x,y} = max(0, z^i_{x,y})</math> where <math> z^i_{x,y} </math> is the total input to the neuron. The reason they use nonlinearity is that it has several advantages over traditional saturating neuron models, such as a significant reduction in training time required to reach a certain error rate. Another advantage is that nonlinearity reduces the need for contrast-normalization and data pre-processing since neurons do not saturate- meaning activities simply scale up little by little with usually large input values. For this model’s only pre-processing step, they subtract the mean activity from each pixel and the result is a centered data.<br />
<br />
=== Objective function ===<br />
<br />
The objective function of their network maximizes the multinomial logistic regression objective which is the same as minimizing the average cross-entropy across training cases between the true label and the model’s predicted label.<br />
<br />
=== Weight Initialization === <br />
<br />
It’s important to note that if a neuron always receives a negative value during training, it will not learn because its output is uniformly zero under the max-with-zero nonlinearity. Hence, the weights in their model were sampled from a zero-mean normal distribution with a high enough variance. High variance in weights will set a certain number of neurons with positive values for learning to happen, and in practice, it’s necessary to try out several candidates for variances until a working initialization is found. In their experiment, setting a positive constant, or 1, as biases of the neurons in the hidden layers was helpful in finding it.<br />
<br />
=== Training ===<br />
<br />
In this model, a batch size of 128 samples and momentum of 0.9, we train our model using stochastic gradient descent. The update rule for weight <math>w</math> is $$ v_{i+1} = 0.9v_i + \epsilon <\frac{dE}{dw_i}> i$$ $$w_{i+1} = w_i + v_{i+1} $$ where <math>i</math> is the iteration index, <math>v</math> is a momentum variable, <math>\epsilon</math> is the learning rate and <math>\frac{dE}{dw}</math> is the average over the <math>i</math>th batch of the derivative of the objective with respect to <math>w_i</math>. The whole training process on CIFAR-10 takes roughly 90 minutes and ImageNet takes 4 days with dropout and two days without.<br />
<br />
=== Learning ===<br />
To determine the learning rate for the network, it is a must to start with an equal learning rate for each layer which produces the largest reduction in the objective function with the power of ten. Usually, it is in the order of <math>10^{-2}</math> or <math>10^{-3}</math>. In this case, they reduce the learning rate twice by a factor of ten before the termination of training.<br />
<br />
= CIFAR-10 =<br />
<br />
=== CIFAR-10 Dataset ===<br />
<br />
Removing incorrect labels, The CIFAR-10 dataset is a subset of the Tiny Images dataset with 10 classes. It contains 5000 training images and 1000 testing images for each class. The dataset has 32 x 32 color images searched from the web and the images are labeled with the noun used to search the image.<br />
<br />
[[File:CIFAR-10.png|thumb|upright=2|center|alt=text|Figure 4: CIFAR-10 Sample Dataset]]<br />
<br />
=== Models for CIFAR-10 ===<br />
<br />
Two models, one with dropout and one without dropout, were built to test the performance of dropout on CIFAR-10. All models have CNN with three convolutional layers each with a pooling layer. All of the pooling payers use a stride=2 and summarize a 3*3 neighborhood. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by the remaining 2 pooling layers. The first and second pooling layers with <math>N = 9, α = 0.001</math>, and <math>β = 0.75</math> are followed by response normalization layers. A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using a filter size of 5×5, all convolutional layers have 64 filter banks.<br />
<br />
Additional changes were made with the model with dropout. The model with dropout enables us to use more parameters because dropout forces a strong regularization on the network. Thus, a fourth weight layer is added to take the input from the previous pooling layer. This fourth weight layer is locally connected, but not convolutional, and contains 16 banks of filters of size 3 × 3 with 50% dropout. Lastly, the softmax layer takes its input from this fourth weight layer.<br />
<br />
Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best-published error rate without using transformed data. The model with one additional locally-connected layer and dropout at the last hidden layer produced an error rate of 15.6%.<br />
<br />
= ImageNet =<br />
<br />
===ImageNet Dataset===<br />
<br />
ImageNet is a dataset of millions of high-resolution images, and they are labeled among 1000 different categories. The data were collected from the web and manually labeled using MTerk tool, which is a crowd-sourcing tool provided by Amazon.<br />
Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images may contain multiple objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.<br />
<br />
'''An ambiguous example to classify:'''<br />
<br />
[[File:imagenet1.png|200px|center]]<br />
<br />
When this paper was written, the best score on this dataset was the error rate of 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error rate using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
'''ImageNet Dataset:'''<br />
<br />
[[File:imagenet2.png|400px|center]]<br />
<br />
===Models for ImageNet===<br />
<br />
They mostly focused on the model with dropout because the one without dropout had a similar approach, but there was a serious issue with overfitting. They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. This could reduce the network’s capacity to overfit the training data and helped generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for testing their model patched at the center, four corners, and their horizontal reflections. To maximize the performance on the validation set, this complicated network architecture was used and it was found that dropout was very effective. Also, it was demonstrated that using non-convolutional higher layers with the number of parameters worked well with dropout, but it had a negative impact on the performance without dropout.<br />
<br />
The network contains seven weight layers. The first five are convolutional, and the last two are globally-connected. Max-pooling layers follow the layer number 1,2, and 5. And then, the output of the last globally-connected layer was fed to a 1000-way softmax output layer. Using this architecture, the authors achieved an error rate of 48.6%. When applying 50% dropout to the 6th layer, the error rate was brought down to 42.4%.<br />
<br />
<br />
[[File:modelh2.png|700px|center]] <br />
<br />
[[File:layer2.png|600px|center]]<br />
<br />
Like the previous datasets, such as the MNIST, TIMIT, Reuters, and CIFAR-10, we also see a significant improvement for the ImageNet dataset. Including complicated architectures like this one, introducing dropout generalizes models better and gives lower test error rates.<br />
<br />
= Conclusion =<br />
<br />
The authors have shown a consistent improvement by the models trained with dropout in classifying objects in the following datasets: MNIST; TIMIT; Reuters Corpus Volume I; CIFAR-10; and ImageNet.<br />
<br />
The authors comment on a theory that sexual reproduction limits biological function to a small number of coadapted genes. The idea is that a given organism is unlikely to receive many coordinated genes from a parent, so will likely die if it relies on many genes to perform a given task. This limits the number of genes required to perform a function, which is like a built-in evolutionary dropout.<br />
<br />
= Critiques =<br />
It is a very brilliant idea to dropout half of the neurons to reduce co-adaptations. It is mentioned that for fully connected layers, dropout in all hidden layers works better than dropout in only one hidden layer. There is another paper Dropout: A Simple Way to Prevent Neural Networks from<br />
Overfitting[https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf] gives a more detailed explanation.<br />
<br />
It will be interesting to see how this paper could be used to prevent the overfitting of LSTMs.<br />
<br />
This paper focused more on CV tasks, it will be interesting to have some discussion on NLP tasks<br />
<br />
Firstly, it is a very interesting topic of classification by the "dropout" CNN method(omitting neurons in hidden layers). If the author can briefly explain the advantages of this method in processing image data in theory, it will be easier for readers to understand. Also, how to deal with the overfitting issue would be valuable.<br />
<br />
The authors mention that they tried various dropout probabilities and that the majority of them improved the model's generalization performance, but that more extreme probabilities tended to be worse which is why a dropout rate of 50% was used in the paper. The authors further develop this point to mention that the method can be improved by adapting individual dropout probabilities of each hidden or input unit using validation tests. This would be an interesting area to further develop and explore, as using a hardcoded 50% dropout for all layers might not be the optimal choice for all CNN applications. It would have been interesting to see the results of their investigations of differing dropout rates.<br />
<br />
The authors don't explain that during training, at each layer that we apply dropout, the values must be scaled by 1/p where p is dropout rate - this way the expected value of the layers is the same in both train and test time. They may have considered another solution for this discrepancy at the time (it is an old paper) but it doesn't seem like any solution was presented here. <br />
<br />
Despite the advantages of using dropout to prevent overfitting and reducing errors in testing, the authors did not discuss much the effects on the length of training time. In another [https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf paper] published a few years later by the same authors, there was more discussion about this. It appears that dropout increases training time by 2-3 times compared to a standard NN with the same architecture, which is a drawback that might be worth mentioning.<br />
<br />
Dropout layers prevent overfitting by random dropout of a fraction of the neurons specified in each layer. In fact, the neurons to be dropped out in each layer are randomly selected. Therefore, it might be the case that some important features in the dropout layer are discarded, which leads to a sudden drop in performance. Although this barely happens, and CNN with dropout rates roughly 50% in each layer will lead to generally good performance, some future improvements are still possible if we are able to select dropout neurons cleverly.<br />
<br />
The article does a good job of analyzing the benefit of using the standard dropout method, but I think it would be beneficial to take a look at other dropout variants. For example, the paper may have benefited from looking at DropConnect which was introduced y L. Wan et al and is similar to dropout layers but it does not apply a dropout directly o the neurons but on the weights and the bias linking the neurons. Others that they also could have looked at were Standout, Pooling Drop and MaxDrop. Comparing various dropout methods I think would greatly add to the paper.<br />
<br />
The author analyzed the dropout method for addressing overfitting problems. The key idea is to randomly drop units from the neural network during training. This prevents units from co-adapting too much. In addition, it also fastens the speed of training models since there are fewer neurons, which is a good idea. <br />
<br />
Random dropping was indeed quite effective in the MNIST fashion classification challenge, however, it may pose a question if the problem has very few features to begin with.<br />
<br />
The authors mentioned that they used Momentum to speed up the training but didn't show the alternative and the speed of the alternative. This [https://link.springer.com/article/10.1007/s11042-019-08453-9 paper]conducts an empirical study of Dropout vs Batch Normalization as well as compares different optimizers (like SGD which uses momentum) for each technique. It is found that optimizers with momentum outperform adaptive optimizers but at a cost of significantly longer training times.<br />
<br />
Dropout is a very popular technique to improve accuracy by avoiding overfitting. It might be interesting to see how it compares to other techniques and how the combination of techniques works.<br />
<br />
Dropout is really popular, but its usefulness is oftentimes inversely proportional to the amount of data in the training data, making it less useful for many modern applications. This paper could talk a little more about alternate methods, such as data augmentation, or L1/L2 regularization, and present them as viable alternatives.<br />
<br />
Dropout rates often vary across layers, the author does not specify which layers are applied to the dropout layer nor the team explains which layers applied are more important than others. The team gave concrete testing results against other models using many standard tests to show the effectiveness of such a method. However, there is no particular visualization (graphs) against other models in terms of efficiency when a standard training epoch, data split, preprocessing or training/testing loss. <br />
<br />
According to the provided frame classification error rate on the core test set of the TIMIT benchmark plot, dropout of 50% of the hidden units and 20% of the input units significantly improves classification. As the plot shown, the curve for finetuning without dropout took a very early turn at epoch size approximately 20, and then maintained very high classification error rate as epoch size increases. The curve for finetuning with dropout kept low no matter the size of epoch after roughly 50, but it fluctuates a lot harder than the curve without dropout, indicating that it has a higher sensitivity in response to epoch size.<br />
<br />
<br />
== Other Work ==<br />
<br />
In modern training, dropout is not advised for convolutional neural networks because it does not have the effect, interpretation, impact on spatial feature maps as dense features. This is because features in CNNs are spatially correlated. There is an interesting paper on DropBlock [2], a dropout method that drops entire contiguous regions of features, which has been shown to be much more effective for CNNs.<br />
<br />
== Reference ==<br />
[1] N. Srivastave, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Jan 2014.<br />
<br />
[2] Ghiasi, Golnaz and Lin, Tsung-Yi and Le, Quoc V. "DropBlock: A regularization method for convolutional networks". NeurIPS, 2018.</div>Sh68leehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Improving_neural_networks_by_preventing_co-adaption_of_feature_detectors_2020_FallImproving neural networks by preventing co-adaption of feature detectors 2020 Fall2020-11-27T02:41:11Z<p>Sh68lee: Created page with "adasdasdasdasdas"</p>
<hr />
<div>adasdasdasdasdas</div>Sh68leehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=A_universal_SNP_and_small-indel_variant_caller_using_deep_neural_networksA universal SNP and small-indel variant caller using deep neural networks2020-11-26T22:18:48Z<p>J27ni: </p>
<hr />
<div>== Background ==<br />
<br />
Genes determine Biological functions, and mutants or alleles(one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome) of those genes determine differences within a function. Determining novel alleles is very important in understanding the genetic variation within a species. For example, different alleles of the gene OCA2 determine the colour of pupils. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).<br />
<br />
Next-generation sequencing is a prevalent technique for sequencing or reading DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next-generation sequencing works by reading short sections of DNA of length k, called k-means, and then piecing them together or aligning them to a reference genome. Next-generation sequencing is relatively fast and inexpensive, although it can randomly misidentify some nucleotides, introducing errors. However, NGS reading is errorful and arises from a complex error process depending on various factors.<br />
<br />
The process of variant calling is determining novel alleles from sequencing data (typically next-generation sequencing data). Some significant alleles only differ from the "standard" version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is crucial to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels are technically challenging since it requires a program to distinguish between genuinely novel mutations and errors in the sequencing data.<br />
<br />
Previous approaches usually involved using various statistical techniques. A widely used one is GATK. GATK uses a combination of logistic regression, hidden Markov models, naive Bayes classification, and Gaussian mixture models to perform the process [2]. However, these methods have their weaknesses as some assumptions do not hold (i.e., independence assumptions). In addition, given that GATK testing has focused primarily on human whole-genome data sequenced using Illumina technology, it is not easily generalizable to different types of data, organisms, and experimental designs/sequencing technologies [[https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices 3]]. <br />
<br />
This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images and classifying whether they contain a mutation. It introduces a variant caller called "DeepVariant", which requires no specialized knowledge, but performs better than previous state-of-art methods.<br />
<br />
== Overview ==<br />
<br />
In Figure 1, the DeepVariant workflow overview is illustrated.<br />
<br />
[[File:figure 111.JPG|Figure 1. In all panels, blue boxes represent data and red boxes are processes]]<br />
<br />
<br />
Initially, the NGS reads aligned to a reference genome which are then scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, the trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box). <br />
To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and a stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box).<br />
For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. The CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example, a heterozygous variant call is emitted, as the most probable genotype here is “het”.<br />
<br />
== Preprocessing ==<br />
<br />
Before the sequencing reads can be fed into the classifier, they must be pre-processed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique by transforming the data to allow us to use more common neural network architectures for classification. The pre-processing of the data can be broken into three phases: the realignment of reads, finding candidate variants and creating the candidate variants' images.<br />
<br />
The realignment of the pre-processing reads phase is essential to ensure the sequences can be adequately compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string to represent a sequence's alignment to a reference for each read.<br />
<br />
Once the reads are correctly aligned, the algorithm then proceeds to find candidate variants, regions in the DNA sequence containing variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not appropriately aligned, marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, which was updated in each read's realignment phase. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.<br />
<br />
The last phase of pre-processing is to convert these candidate variants into images representing the data with candidate variants identified. This allows for the use of well established convolutional neural networks for image classification for this technical problem. Each color channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T) by mapping each base to a particular value. The quality of the read is mapped to the green color channel.<br />
<br />
Moreover, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image to maintain its information after a 5x5 convolution is applied. With the data pre-processing complete, the images can then be passed into the neural network for classification.<br />
<br />
== Neural Network ==<br />
<br />
The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The inception v2 model in particular uses a series of CNNs. One interesting aspect about the Inception model is that rather than optimizing a series of hyperparameters in order to determine the most optimal parameter configuration, Inception instead concatenates a series of different sizes of filters on the same layer, which acts to learn the best architecture out of these concatenated filters. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three-class Softmax layer initialized with Gaussian random weights with a standard deviation of 0.001. This final layer is fully connected to the previous layer. The three classes are the homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.<br />
<br />
== Results ==<br />
<br />
DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphism Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.<br />
<br />
[[File:table 11.JPG]]<br />
<br />
DeepVariant showed the highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm. <br />
<br />
They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2).<br />
<br />
[[File:Table 333.JPG]]<br />
<br />
Results illustrated that the DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.<br />
<br />
== Conclusion ==<br />
<br />
This endeavor to further advance a data-centric approach to understanding the gene sequence illustrates the advantages of deep learning over humans. With billions of DNA base pairs, no humans can digest that amount of gene expressions. In the past, computational techniques are unfeasible due to the lack of computing power, but in the 21st century, it seems that machine learning is the way to go for molecular biology.<br />
<br />
DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other states of the art variant callers, DeepVariant does not know the sequencing technologies that create the reads or even the biological processes that introduce mutations. It simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and developing a dedicated CNN architecture for the reads, rather than casting them as images.<br />
<br />
== Online Methods ==<br />
<br />
Besides the main topic, the paper also introduced some valuable online methods. These methods have cross intersection with the main topic, although not implicitly, they really showed how some of the other tasks can be tackled using efficient methods. <br />
<br />
=== Haplotype-aware realignment of reads ===<br />
<br />
Mapped reads are preprocessed using an error-tolerant, local De-Bruijn-graph-based read assembly procedure. Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation. The selection criteria for a candidate window are very permissive so that true variation is unlikely to be missed.<br />
<br />
=== Finding candidate variants ===<br />
<br />
=== Creating images around candidate variants ===<br />
<br />
=== Deep learning ===<br />
<br />
=== Deep Variant inference client and allele merging ===<br />
<br />
== Critique and Discussion==<br />
<br />
The paper presents an attractive method for solving a significant problem. Building "images" of reads and running them through a generic image classification CNN seems like a strange approach, and, interestingly, it works well. The most significant issues with the paper are the lack of specific information about how the methods. Some extra information is included in the supplementary material, but there are still some significant gaps. In particular:<br />
<br />
1. What is the structure of the neural net? How many layers, and what sizes? The paper for ConvNet does not have this information. We suspect that this might be a trade secret that Google is protecting.<br />
<br />
2. How is the realignment step implemented? The paper mentions that it uses a "De-Bruijn-graph-based read assembly procedure" to realign reads to a new haplotype. It is a non-standard step in most genomics workflows, yet the paper does not describe how they do the realignment or build the haplotypes.<br />
<br />
3. How did they settle on the image construction algorithm? The authors provide pseudocode for the construction of pileup images, but they do not describe how to make decisions. For instance, the color values for different base pairs are not evenly spaced. Also, the image begins with five rows of the reference genome.<br />
<br />
One thing we appreciated about the paper was their commentary on future developments. The authors clarify that this approach can be improved on and provide specific ideas for the next steps.<br />
<br />
Overall, the paper presents an interesting idea with strong results but lacks detail in some vital implementation pieces.<br />
<br />
4. The topic of this project is good, but we need more details on the algorithm. In the neural network part, the details are not enough; authors should provide a figure to explain better how the model works and the model's structure. Otherwise, we cannot understand how the model works. When we preprocess the data if different data have different lengths, shall we add more information or drop some information to match?<br />
<br />
5. Particularly, which package did the researchers use to perform this analysis? Different packages of deep learning can have different accuracy and efficiency while making predictions on this data set. <br />
<br />
Further studies on DeepVariant [https://www.nature.com/articles/s41598-018-36177-7 have shown] that it is a framework with great potential and sets the medical standard genetics field.<br />
<br />
Another good follow up works can be seen here [https://www.semanticscholar.org/paper/A-review-of-somatic-single-nucleotide-variant-for-Xu/81b4246c90c3c8036ff719b9d4c3d5e83fbb32dc]<br />
<br />
6. It was mentioned that part of the network used is an "adapted Inception v2 ImageNet model" - does this mean that it used an Inception v2 model that was trained on ImageNet? This is not clear, but if this is the case, then why is this useful? Why would the features that are extracted for an image be useful for genomics? Did they try using a model that was not trained? Also, they describe the preprocessing method used but were there any alternatives that they considered?<br />
<br />
7. A more extensive discussion on the "Neural Network" section can be given. For example, the paragraph's last sentence says that "Models are trained for 80 hours, or until the training accuracy converged." This sentence implies that if the training accuracy does converge, it usually takes less than 80 hours. More exciting data can thus be presented about the training accuracy converging time. How often do the models converge? How long does it take for the models to converge on average? Moreover, why is the number 80 chosen here? Is it a random upper bound set by people to bound the runtime of the model training, or is the number 80 carefully chosen so that it is twice/three times/ten times the average training accuracy converging time?<br />
<br />
8. It would be more convincing if the author could provide more detail on the structure of the neural network.<br />
<br />
9. It is clear that this is a very thoroughly written paper with substantial comparison results and computation numbers to back up the testing. However, simply because the implementation link is given in the paper, there still lacks information regarding the structure of the model. It is also structurally missing a conclusion section to complete a summary on the overall conclusion and results comparison to give a final conclusion to the efficacy of the proposed model.<br />
<br />
10. It would be interesting to see how the model would behave if we incorporate transformers to the model, this has been on reason Alpha Fold 2 was so successful.<br />
<br />
11. The result illustration is hard to interpret, it would be nicer if explanations can be added. Lacking details on neural networks used. How was mathematical calculations done on prediction?<br />
<br />
12. It would be better if the author could provide a list of other potential networks that could be used to address the problem and a comparison between them.<br />
<br />
13. It is very interesting to see that a part of the data preprocessing is to create images from the data as one would normally expect you to simply feed the data itself into the network. By converting to image data does this help the network classify it better or would simply feeding forward the unconverted data be better? This is interesting as at the end of the day an image is still just numerical data so would one representation hold more value over the other?<br />
<br />
14. The author opened the door to machine learning solutions in molecular biology by solving genetic problems through data methods. But the details of the neural network training parameters still needs to be clarified. In addition, it is worthy of attention and further explanation about the results of prediction accuracy, under what environment, and how to compare with human predictions.<br />
<br />
== References ==<br />
[1] Hartwell, L.H. ''et. al.'' ''Genetics: From Genes to Genomes''. (McGraw-Hill Ryerson, 2014).<br />
<br />
[2] Poplin, R. ''et. al''. A universal SNP and small-indel variant caller using deep neural networks. ''Nature Biotechnology'' '''36''', 983-987 (2018).</div>Iaoellmehttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Loss_Function_Search_for_Face_RecognitionLoss Function Search for Face Recognition2020-11-26T18:59:59Z<p>J27ni: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang<br />
<br />
== Introduction ==<br />
Face recognition is a technology that can label a face to a specific identity. The field of study involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face image and another face image map to the same identity. Loss functions play an important role in evaluating how well the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. A discriminative feature is one that is able to successfully discriminate the labeled data, and is typically a result of feature engineering/selection. However, traditional softmax loss lacks the power of feature discrimination. To solve this problem, a center loss was developed to learn centers for each identity to enhance the intra-class compactness. Hence, the paper introduced a new loss function using a scale parameter to produce higher gradients to well-separated samples which can reduce the softmax probability. <br />
<br />
Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much efforts such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective [2]. It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.<br />
<br />
'''Soft Max'''<br />
<br />
Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative value of target values times the log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:<br />
<br />
<center><math>L_1=-\log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}</math> [1] </center><br />
<br />
<br />
Specifically for face recognition, <math>L_1</math> is modified such that <math>w^T_yx</math> is normalized and <math>s</math> represents the magnitude of <math>w^T_yx</math>:<br />
<br />
<center><math>L_2=-\log\frac{e^{s \cos{(\theta_{{w_y},x})}}}{e^{s \cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}}</math> [1] </center><br />
<br />
Where <math> \cos{(\theta_{{w_k},x})} = w^T_y </math> is cosine similarity and <math>\theta_{{w_k},x}</math> is angle between <math> w_k</math> and x. The learnt features with this soft max loss are prone to be separable (as desired).<br />
<br />
'''Margin-based Softmax'''<br />
<br />
This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, the functions are built upon the same structure as the equation above.<br />
<br />
The margin-based softmax function is:<br />
<br />
<center><math>L_3=-\log\frac{e^{s f{(m,\theta_{{w_y},x})}}}{e^{s f{(m,\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s \cos{(\theta_{{w_y},x})}}}} </math> </center><br />
<br />
Here, <math>f{(m,\theta_{{w_y},x})} \leq \cos (\theta_{w_y,x})</math> is a carefully chosen margin function.<br />
<br />
Some other variations of chosen functions:<br />
<br />
'''A-Softmax Loss: ''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x})</math> , where m1 >= 1 and a integer.<br />
<br />
'''Arc-Softmax Loss: ''' <math>f{(m_1,\theta_{{w_y},x})} = \cos (\theta_{w_y,x} + m_2)</math>, where m2 > 0<br />
<br />
'''AM-Softmax Loss: ''' <math>f{(m,\theta_{{w_y},x})} = \cos (m_1\theta_{w_y,x} + m_2) - m_3</math>, where m1 >= 1 and a integer; m2,m3 > 0<br />
<br />
<br />
<br />
In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.<br />
<br />
== Motivation ==<br />
Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions is high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed including margin-based formulations which often require fine-tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.<br />
<br />
To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one required parameter and an improved search space using a reward-based method was determined by the authors to be the best option for their loss function.<br />
<br />
== Problem Formulation ==<br />
=== Analysis of Margin-based Softmax Loss ===<br />
Based on the softmax probability and the margin-based softmax probability, the following function can be developed [1]:<br />
<br />
<center><math>p_m=\frac{1}{ap+(1-a)}*p</math></center><br />
<center> where <math>a=1-e^{s\,{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}</math> and <math>a≤0</math></center><br />
<br />
<math>a</math> is considered as a modulating factor and <math>h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]</math> is a modulating function [1]. Therefore, regardless of the margin function (<math>f</math>), the minimization of the softmax probability will ensure success.<br />
<br />
Compared to AM-LFS, this method involves only one parameter (<math>a</math>) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS (<math>p_m={a_i}p+b_i</math>) may not be discriminative because it could be larger than the softmax probability, while <math>p_m=h(a, p)*p < p </math> always holds.<br />
<br />
=== Random Search ===<br />
Unified formulation <math>L_5</math> is generated by inserting a simple modulating function <math>h{(a,p)}=\frac{1}{ap+(1-a)}</math> into the original softmax loss. It can be written as below [1]:<br />
<br />
<center><math>L_5=-log{(h{(a,p)}*p)}</math> where <math>h \in (0,1]</math> and <math>a≤0</math></center><br />
<br />
This encourages the feature margin between different classes and has the capability of feature discrimination. This leads to defining the search space as the choice of <math>h{(a,p)}</math> whose impacts on the training procedure are decided by the modulating factor <math>a</math>. In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.<br />
<br />
=== Reward-Guided Search ===<br />
Random search has no guidance for training. To solve this, the authors use reinforcement learning. Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward [3]. The process of RL is shown in figure 1. The equation of the cumulative reward function is: <br />
<br />
<center><math>G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T</math></center><br />
<br />
where <math>G_t</math> = cumulative reward, <math>R_t</math> = immediate reward, and <math>R_T</math> = end of episode.<br />
<br />
<math>G_t</math> is the sum of immediate rewards from arbitrary time <math>t</math>. It is a random variable because it depends on the immediate reward which depends on the agent action and the environment's reaction to this action.<br />
<br />
<center>[[Image:G25_Figure1.png|300px |link=https://en.wikipedia.org/wiki/Reinforcement_learning#/media/File:Reinforcement_learning_diagram.svg |alt=Alt text|Title text]]</center><br />
<center>Figure 1: Reinforcement Learning scenario [4]</center><br />
<br />
The reward function is what guides the agent to move in a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed [5]. <br />
<br />
In this paper, RL is being used to generate a distribution of the hyperparameter <math>\mu</math> for the SoftMax equation using the reward function. At each epoch, <math>B</math> hyper-parameters <math>{a_1, a_2, ..., a_B }</math> are sampled as <math>a \sim \mathcal{N}(\mu, \sigma)</math>. In each epoch, <math>B</math> models are generated with rewards <math>R(a_i), i \in [1, B]</math>. <math>\mu</math> updates after each epoch from the reward function. <br />
<br />
<center><math>\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}</math></center><br />
<br />
Where <math>{g(a_i; \mu, \sigma})</math> is the PDF of a Gaussian distribution. The distributions of <math>{a}</math> are updated and the best model if found from the <math>{B}</math> candidates for the next epoch.<br />
<br />
=== Optimization ===<br />
Calculating the reward involves a standard bi-level optimization problem. A standard bi-level optimization problem is a hierarchy of two optimization tasks, an upper-level or leader and lower-level or follower problems, which involves a hyperparameter ({<math>a_1,a_2,…,a_B</math>}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:<br />
<br />
<center><math>max_a R(a)=r(M_{w^*(a)},S_v)</math></center><br />
<center><math>w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)</math></center><br />
<br />
In this case, the loss function takes the training set <math>S_t</math> and the reward function takes the validation set <math>S_v</math>. The weights <math>w</math> are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({<math>M_{we1},M_{we2},…,M_{weB}</math>}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. In the end, the algorithm takes the model with the highest score without retraining.<br />
<br />
== Results and Discussion ==<br />
=== Data Preprocessing ===<br />
The training datasets consisted of cleaned versions of CASIA-WebFace and MS-Celeb-1M-v1c to remove the impact of noisy labels in the original sets.<br />
Furthermore, it is important to perform open-set evaluation for face recognition problems. That is, there shall be no overlapping identities between training and testing sets. As a result, there were a total of 15,414 identities removed from the testing sets. For fairness during comparison, all summarized results will be based on refined datasets.<br />
<br />
=== Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP ===<br />
For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms, however, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.<br />
<br />
Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boosts the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discrimination power. Also, the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation. The following table gives a summary of the performance of each model.<br />
<br />
<center>Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<br />
<center>[[Image:G25_Table1.png|900px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on RFW ===<br />
The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3. <br />
<br />
<center>Table 2. Verification performance (%) of different methods on the test set RFW. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table2.png|500px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 3. Verification performance (%) of different methods on the test set RFW. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table3.png|500px |alt=Alt text|Title text]]</center><br />
<br />
=== Results on MegaFace and Trillion-Pairs ===<br />
The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at <math>1e^{-3}</math> on MegaFace, the identification TPR@FAR = <math>1e^{-6}</math> and the verification TPR@FAR = <math>1e^{-9}</math> on Trillion-Pairs are reported on Table 4 and 5.<br />
<br />
On the test sets MegaFace and Trillion-Pairs, Search-Softmax achieves the best performance over all other alternative methods. On MegaFace, Search-Softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to newly designed search space. <br />
<br />
<center>Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''CASIA-WebFace-R''' [1].</center><br />
<center>[[Image:G25_Table4.png|450px |alt=Alt text|Title text]]</center><br />
<br />
<br />
<center>Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is '''MS-Celeb-1M-v1c-R''' [1].</center><br />
<center>[[Image:G25_Table5.png|450px |alt=Alt text|Title text]]</center><br />
<br />
From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a similar trend with Trillion-Pairs where Search-Softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.<br />
<br />
<center>[[Image:G25_Figure2_left.png|800px |alt=Alt text|Title text]] [[Image:G25_Figure2_right.png|800px |alt=Alt text|Title text]]</center><br />
<center>Figure 2. From Left to Right: CMC curves and ROC curves on MegaFace Set with training set CASIA-WebFace-R, CMC curves and ROC curves on MegaFace Set with training set MS-Celeb-1M-v1c-R [1].</center><br />
<br />
== Conclusion ==<br />
The paper discussed that in order to enhance feature discrimination for face recognition, it is crucial to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. In addition, we define a new search space to ensure feature recognition. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets. While these developed methods were generally more effective in increasing accuracy versus previous methods, there is very little difference between the two. It can be seen that Search-Softmax performs slightly better than Random-Softmax most of the time.<br />
<br />
== Critiques ==<br />
* Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument.<br />
* Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would.<br />
* AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate.<br />
* The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show many advantages since they produce very similar results. A more complicated data set needs to be tested to prove the method's reliability.<br />
* There is another paper Large-Margin Softmax Loss for Convolutional Neural Networks[https://arxiv.org/pdf/1612.02295.pdf] that provides a more detailed explanation about how to reduce margin-based softmax loss.<br />
* It is questionable when it comes to the accuracy of testing sets, as they only used the clean version of CASIA-WebFace and MS-Celeb-1M-vlc for training instead of these two training sets with noisy labels.<br />
* In a similar [https://arxiv.org/pdf/1905.09773.pdf?utm_source=thenewstack&utm_medium=website&utm_campaign=platform paper], written by Tae-Hyun Oh et al., they also discuss an optimal loss function for face recognition. However, since in the other paper, they were doing face recognition from voice audio, the loss function used was slightly different than the ones discussed in this paper.<br />
* This model has many applications such as identifying disguised prisoners for police. But we need to do a good data preprocessing otherwise we might not get a good predicted result. But authors did not mention about the data preprocessing which is a key part of this model. It might be done in either feature extraction or feature selection, but it's crucial to indicate how they did it.<br />
* It will be better if we can know what kind of noises was removed in the clean version. Also, simply removing the overlapping data is wasteful. It would be better to just put them into one of the train and test samples.<br />
* This paper indicate that the new searching method and loss function have induced more effective face recognition result than other six methods. But there is no mention of the increase or decrease in computational efficiency since only very little difference exist between those methods and the real time evaluation is often required at the face recognition application level.<br />
* There are some loss functions that receives more than 2 inputs. For example, the ''triplet loss'' function, developed by Google, takes 3 inputs: positive input, negative input and anchor input. This makes sense because for face recognition, we want to model to learn not only what it is supposed to predict but also what it is not supposed to predict. Typically, triplet loss handles false positives much better. This paper can extend its scope to such loss function that takes more than 2 inputs.<br />
* It would be good to also know what the training time is like for the method, specifically the "Reward-Guided Search" which uses RL. Also the authors mention some data preprocessing that was performed, was this same preprocessing also performed for the methods they compared against?<br />
* Sections on Data Processing and Results can be improved. About the datasets, I have some questions about why they are divided in the current fashion. It is mentioned that "CASIA-WebFace and MS-Celeb-1M-v1c" are used as training datasets. But the comparison of algorithms are divided into three groups: Megaface and TrillionPairs, RFW, and a group of other datasets. In general, when we are comparing algorithms, we want to have a holistic view of how each algorithm compare. So I have some concerns about dividing the results into three section. More explanation can be provided. It also seems like Random-Softmax and Search Softmax outperform all other algorithms across all datasets. So it would make even more sense to have a big table including all the results. About data preprocessing, I believe that giving more information about which noisy data are removed would be nice.<br />
* Despite thorough comparison between each method against the proposed method, it does not give a reason to why it was the case that it was either better or worse, and it does not necessarily need to be a mathematical explanation but an intuitive one to demonstrate how it can be replicated and whether the results require a certain condition to achieve. <br />
* Though we have a graph demonstrating the training loss with Random-Softmax and Search-Softmax with regards to the number of Epochs as an independent variable which we may deduce the number of epochs used in later graphs but since one of the main features is that "Meanwhile, our optimization strategy enables that the dynamic loss can guide<br />
* Did the paper address why the average model performs worse on African faces, would it be a lack of data points?<br />
the model training of different epochs, which helps further boost the discrimination power." it is imperative that the results are comparable along the same scale (for example, for 20 epochs, then take the average of the losses).<br />
* The result summary is overwhelming with numbers and representation of result is lacking. It would be great if the result can be explained. Introduction of model and its component is lacking and could be explained more.<br />
* It would be better if the paper contains some Face Recognition visualization, i.e. show actually face recognition example to show the improvement.<br />
* The introduction of data and the analysis of data processing are important because there might be some limitations. Also, it would be better to give theoretical analysis of the effects of reducing softmax probability and the number of sampled models, which explains the update of the parameters for better performance.<br />
* It would be better to include time performance in the evaluation section.<br />
* The paper is missing details on datasets. It would be better to know if the datasets were balanced or unbalanced and how this would affect the accuracy. Also, computational comparisons between the new loss function versus traditional method would be interesting to know.<br />
* The paper included a dataset that measures racial bias, however it is a widely known fact that majority of face recognition models are trained on biased and imbalanced datasets themselves. For example, AI that has bias towards classifying a black person as a prisoner since the training set of prisoners is predominantly black. A question that remains unanswered is how training a model using the proposed loss function helps to combat racial bias in machine learning, and how these results in particular improved (or worsened) with its use.<br />
* There are too much data in the conclusion part. A brief conclusion based on several sentences should be enough to present the ideas.<br />
* The author could add the time efficiency of fave recognition in the result to compare the models with other current models for facial recognition since nowadays many application that uses face recognition rely on fast recognition(e.g. unlock phone with face id)<br />
*It is interesting to see how loss function impacts face recognition. It would be better to see different evaluations based on different datasets. Not only accuracy is important, but also efficiency is significant.<br />
* It is important for facial recognition models to feature some level of noise, since for most uses, things like lighting and camera quality is drastically different. For a more practical use of this, datasets with a certain level of noise should be introduced and tested.<br />
* The paper omits the full details of the necessity of normalization within the model; as described in the original [https://arxiv.org/abs/1801.05599 paper] describing A-softmax for face verification they stressed the importance of normalization. Does the lack of detail mean these new softmax loss functions don't require layer normalization?<br />
* The paper introduced the Search-Softmax algorithm, and written clearly in pseudocode. It will be more helpful if the piece of pseudocode is attached in the summary, because it is better than description in words. Also, it is important to reveal the experiment setting before talking about results. In the original paper, it is clearly stated the results were obtained under what condition. The condition includes what method was utilized for data processing(in this case it is a 6-layer CNN), how was the method implemented, etc. There is also another important concept mentioned called the ablation study. Basically it studied the effect of reducing softmax probability, the number of sampled models and the method convergence. They are important for later analysis because they are crucial to model optimization and usability. If the proposed Random-Softmax and Search-Softmax losses do not converge to a value(0 to be specific), then it is useless.<br />
<br />
== References ==<br />
[1] X. Wang, S. Wang, C. Chi, S. Zhang and T. Mei, "Loss Function Search for Face Recognition", in International Conference on Machine Learning, 2020, pp. 1-10.<br />
<br />
[2] Li, C., Yuan, X., Lin, C., Guo, M., Wu, W., Yan, J., and Ouyang, W. Am-lfs: Automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8410–8419, 2019.<br />
2020].<br />
<br />
[3] S. L. AI, “Reinforcement Learning algorithms - an intuitive overview,” Medium, 18-Feb-2019. [Online]. Available: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc. [Accessed: 25-Nov-2020]. <br />
<br />
[4] “Reinforcement learning,” Wikipedia, 17-Nov-2020. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 24-Nov-2020].<br />
<br />
[5] B. Osiński, “What is reinforcement learning? The complete guide,” deepsense.ai, 23-Jul-2020. [Online]. Available: https://deepsense.ai/what-is-reinforcement-learning-the-complete-guide/. [Accessed: 25-Nov-2020].</div>Jcllauhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Music_Recommender_System_Based_using_CRNNMusic Recommender System Based using CRNN2020-11-25T22:45:17Z<p>J27ni: /* Critiques/ Insights: */</p>
<hr />
<div>==Introduction and Objective:==<br />
<br />
In the digital era of music streaming, companies, such as Spotify and Pandora, are faced with the following challenge: can they provide users with relevant and personalized music recommendations amidst the ever-growing abundance of music and user data?<br />
<br />
The objective of this paper is to implement a personalized music recommender system that takes user listening history as input and continually finds new music that captures individual user preferences.<br />
<br />
This paper argues that a music recommendation system should vary from the general recommendation system used in practice since it should combine music feature recognition and audio processing technologies to extract music features, and combine them with data on user preferences.<br />
<br />
The authors of this paper took a content-based music approach to build the recommendation system - specifically, comparing the similarity of features based on the audio signal.<br />
<br />
The following two-method approach for building the recommendation system was followed:<br />
#Make recommendations including genre information extracted from classification algorithms.<br />
#Make recommendations without genre information.<br />
<br />
The authors used convolutional recurrent neural networks (CRNN), which is a combination of 2D convolutional neural networks (CNN) and recurrent neural network(RNN), as their main classification model.<br />
<br />
==Methods and Techniques:==<br />
Generally, a music recommender can be divided into three main parts: (i) users, (ii) items, and (iii) user-item matching algorithms. Firstly, a model for a user's music taste is generated based on their profiles. Secondly, item profiling based on editorial, cultural, and acoustic metadata is exploited to increase listener satisfaction. Thirdly, a matching algorithm is employed to recommend personalized music to the listener. Two main approaches are currently available;<br />
<br />
1. Collaborative filtering<br />
<br />
It is based on users' historical listening data and depends on user ratings. Nearest neighbour is the standard method used for collaborative filtering and can be broken into two classes of methods: (i) user-based neighbourhood methods and (ii) item-based neighbourhood methods. <br />
<br />
User-based neighbourhood methods calculate the similarity between the target user and other users, and selects the k most similar. A weighted average of the most similar users' song ratings is then computed to predict how the target user would rate those songs. Songs that have a high predicted rating are then recommended to the user. In contrast, methods that use item-based neighbourhoods calculate similarities between songs that the target user has rated well and songs they have not listened to in order to recommend songs.<br />
<br />
That being said, collaborative filtering faces many challenges. For example, given that each user sees only a small portion of all music libraries, sparsity and scalability become an issue. However, this can be dealt with using matrix factorization. A more difficult challenge to overcome is the fact that users often don't rate songs when they are listening to music. <br />
<br />
2. Content-based filtering<br />
<br />
Content based recommendation systems base their recommendations on the similarity of an items features and features that the user has enjoyed. It has two-steps; (i) Extract audio content features and (ii) predict user preferences.<br />
<br />
However content-based filtering has to overcome the challenge of only being able to predict based on users' existing interests. The model is unable to effectively scale to a user's ever changing music taste.<br />
<br />
In this work, the authors take a content-based approach, as they compare the similarity of audio signal features to make recommendations. To classify music, the original music’s audio signal is converted into a spectrogram image. Using the image and the Short Time Fourier Transform (STFT), we convert the data into the Mel scale which is used in the CNN and CRNN models. <br />
=== Mel Scale: === <br />
The scale of pitches that are heard by listeners, which translates to equal pitch increments.<br />
<br />
[[File:Mel.png|frame|none|Mel Scale on Spectrogram]]<br />
<br />
=== Short Time Fourier Transform (STFT): ===<br />
The transformation that determines the sinusoidal frequency of the audio, with a Hanning smoothing function. In the continuous case this is written as: <math>\mathbf{STFT}\{x(t)\}(\tau,\omega) \equiv X(\tau, \omega) = \int_{-\infty}^{\infty} x(t) w(t-\tau) e^{-i \omega t} \, d t </math><br />
<br />
where: <math>w(\tau)</math> is the Hanning smoothing function. The STFT is applied over a specified window length at a certain time allowing the frequency to represented for that given window rather than the entire signal as a typical Fourier Transform would.<br />
<br />
=== Convolutional Neural Network (CNN): ===<br />
A Convolutional Neural Network is a Neural Network that uses convolution in place of matrix multiplication for some layer calculations. By training the data, weights for inputs are updated to find the most significant data relevant to classification. These convolutional layers gather small groups of data with kernels and try to find patterns that can help find features in the overall data. The features are then used for classification. Padding is another technique used to extend the pixels on the edge of the original image to allow the kernel to more accurately capture the borderline pixels. Padding is also used if one wishes the convolved output image to have a certain size. The image on the left represents the mathematical expression of a convolution operation, while the right image demonstrates an application of a kernel on the data.<br />
<br />
[[File:Convolution.png|thumb|400px|left|Convolution Operation]]<br />
[[File:PaddingKernels.png|thumb|400px|center|Example of Padding (white 0s) and Kernels (blue square)]]<br />
<br />
=== Convolutional Recurrent Neural Network (CRNN): === <br />
The CRNN is similar to the architecture of a CNN, but with the addition of a Gated Recurrent Unit (GRU), which is a Recurrent Neural Network (RNN). An RNN is used to treat sequential data, by reusing the activation function of previous nodes to update the output. The GRU is used to store more long-term memory and will help train the early hidden layers. GRUs can be thought of as LSTMs but with a forget gate, and has fewer parameters than an LSTM. These gates are used to determine how much information from the past should be passed along onto the future. They are originally aimed to prevent the vanishing gradient problem, since deeper networks will result in smaller and smaller gradients at each layer. The GRU can choose to copy over all the information in the past, thus eliminating the risk of vanishing gradients.<br />
<br />
[[File:GRU441.png|thumb|400px|left|Gated Recurrent Unit (GRU)]]<br />
[[File:Recurrent441.png|thumb|400px|center|Diagram of General Recurrent Neural Network]]<br />
<br />
==Data Screening:==<br />
<br />
The authors of this paper used a publicly available music dataset made up of 25,000 30-second songs from the Free Music Archives which contains 16 different genres. The data is cleaned up by removing low audio quality songs, wrongly labelled genres and those that have multiple genres. To ensure a balanced dataset, only 1000 songs each from the genres of classical, electronic, folk, hip-hop, instrumental, jazz and rock were used in the final model. <br />
<br />
[[File:Data441.png|thumb|200px|none|Data sorted by music genre]]<br />
<br />
==Implementation:==<br />
<br />
=== Modeling Neural Networks ===<br />
<br />
As noted previously, both CNNs and CRNNs were used to model the data. The advantage of CRNNs is that they are able to model time sequence patterns in addition to frequency features from the spectrogram, allowing for greater identification of important features. Furthermore, feature vectors produced before the classification stage could be used to improve accuracy. <br />
<br />
In implementing the neural networks, the Mel-spectrogram data was split up into training, validation, and test sets at a ratio of 8:1:1 respectively and labelled via one-hot encoding. This made it possible for the categorical data to be labelled correctly for binary classification. As opposed to classical stochastic gradient descent, the authors opted to use binary classier and ADAM optimization to update weights in the training phase, and parameters of <math>\alpha = 0.001, \beta_1 = 0.9, \beta_2 = 0.999</math>. Binary cross-entropy was used as the loss function. <br />
Input spectrogram image are 96x1366. In both the CNN and CRNN models, the data was trained over 100 epochs with a batch size of 50 (limited computing power) and using binary cross-entropy as the loss function. Notable model specific details are below:<br />
<br />
'''CNN'''<br />
* Five convolutional layers with 3x3 kernel, stride 1, padding, batch normalization, and ReLU activation<br />
* Max pooling layers <br />
* The sigmoid function was used as the output layer<br />
<br />
'''CRNN'''<br />
* Four convolutional layers with 3x3 kernel (which construct a 2D temporal pattern - two layers of RNNs with Gated Recurrent Units), stride 1, padding, batch normalization, ReLU activation, and dropout rate 0.1<br />
* Feature maps are N x1x15 (N = number of features maps, 68 feature maps in this case) is used for RNNs.<br />
* 4 Max pooling layers for four convolutional layers with kernel ((2x2)-(3x3)-(4x4)-(4x4)) and same stride<br />
* The sigmoid function was used as the output layer<br />
<br />
The CNN and CRNN architecture is also given in the charts below.<br />
<br />
[[File:CNN441.png|thumb|800px|none|Implementation of CNN Model]]<br />
[[File:CRNN441.png|thumb|800px|none|Implementation of CRNN Model]]<br />
<br />
=== Music Recommendation System ===<br />
<br />
The recommendation system utilizes cosine similarity of the extracted features from the neural network to compute similarity. Each genre will have a song act as a centre point for each class. The final inputs of the trained neural networks will be the feature variables. The feature variables will be used in the cosine similarity to find the best recommendations. <br />
<br />
The values are between [-1,1], where larger values are songs that have similar features. When the user inputs five songs, those songs become the new inputs in the neural networks and the features are used by the cosine similarity with other music. The largest five cosine similarities are used as recommendations.<br />
[[File:Cosine441.png|frame|100px|none|Cosine Similarity]]<br />
<br />
== Evaluation Metrics ==<br />
=== Precision: ===<br />
* The proportion of True Positives with respect to the '''predicted''' positive cases (true positives and false positives)<br />
* For example, out of all the songs that the classifier '''predicted''' as Classical, how many are actually Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among those predicted to be of that certain genre<br />
<br />
=== Recall: ===<br />
* The proportion of True Positives with respect to the '''actual''' positive cases (true positives and false negatives)<br />
* For example, out of all the songs that are '''actually''' Classical, how many are correctly predicted to be Classical?<br />
* Describes the rate at which the classifier predicts the true genre of songs among the correct instances of that genre<br />
<br />
=== F1-Score: ===<br />
An accuracy metric that combines the classifier’s precision and recall scores by taking the harmonic mean between the two metrics:<br />
<br />
[[File:F1441.png|frame|100px|none|F1-Score]]<br />
<br />
=== Receiver operating characteristics (ROC): ===<br />
* A graphical metric that is used to assess a classification model at different classification thresholds <br />
* In the case of a classification threshold of 0.5, this means that if <math>P(Y = k | X = x) > 0.5</math> then we classify this instance as class k<br />
* Plots the true positive rate versus false positive rate as the classification threshold is varied<br />
<br />
[[File:ROCGraph.jpg|thumb|400px|none|ROC Graph. Comparison of True Positive Rate and False Positive Rate]]<br />
<br />
=== Area Under the Curve (AUC) ===<br />
AUC is the area under the ROC in doing so, the ROC provides an aggregate measure across all possible classification thresholds.<br />
<br />
In the context of the paper: When scoring all songs as <math>Prob(Classical | X=x)</math>, it is the probability that the model ranks a random Classical song at a higher probability than a random non-Classical song.<br />
<br />
[[File:AUCGraph.jpg|thumb|400px|none|Area under the ROC curve.]]<br />
<br />
== Results ==<br />
=== Accuracy Metrics ===<br />
The table below is the accuracy metrics with the classification threshold of 0.5.<br />
<br />
[[File:TruePositiveChart.jpg|thumb|none|True Positive / False Positive Chart]]<br />
On average, CRNN outperforms CNN in true positive and false positive cases. In addition, it is very apparent that false positive rate is much more higher for songs in the Instrumental genre, perhaps indicating that more pre-processing needs to be done for songs in this genre or that it should be excluded from the analysis completely since most music incorporates instrumental components.<br />
<br />
<br />
[[File:F1Chart441.jpg|thumb|400px|none|F1 Chart]]<br />
On average, CRNN outperforms CNN in F1-score. <br />
<br />
<br />
[[File:AUCChart.jpg|thumb|400px|none|AUC Chart]]<br />
On average, CRNN also outperforms CNN in AUC metric.<br />
<br />
<br />
CRNN models that consider the frequency features and time sequence patterns of songs have a better classification performance through metrics such as F1 score and AUC compared to the CNN classifier.<br />
<br />
=== Evaluation of Music Recommendation System: ===<br />
<br />
* A listening experiment was performed with 30 participants to assess user responses to given music recommendations.<br />
* Participants choose 5 pieces of music they enjoy and the recommender system generates 5 new recommendations. The participants then evaluate the recommendation by recording whether they liked or disliked the music recommendation<br />
* The recommendation system takes two approaches to the recommendation:<br />
** Method one uses only the value of cosine similarity.<br />
** Method two uses the value of cosine similarity and information on music genre.<br />
*Perform test of significance of differences in average user likes between the two methods using a t-statistic:<br />
[[File:H0441.png|frame|100px|none|Hypothesis test between method 1 and method 2]]<br />
<br />
Comparing the two methods, <math> H_0: u_1 - u_2 = 0</math>, we have <math> t_{stat} = -4.743 < -2.037 </math>, which demonstrates that the increase in average user likes with the addition of music genre information is statistically significant.<br />
<br />
== Conclusion: ==<br />
<br />
The two two main conclusions obtained from this paper:<br />
<br />
* The music genre should be a key feature to increase the predictive capabilities of the music recommendation system.<br />
<br />
* To extract the song genre from a song’s audio signals and get overall better performance, CRNN’s are superior to CNN’s as they consider frequency in features and time sequence patterns of audio signals. <br />
<br />
According to the paper, the authors suggested adding other music features like tempo gram for capturing local tempo as a way to improve the accuracy of the recommender system.<br />
<br />
== Critiques/ Insights: ==<br />
# It would be helpful if authors bench-mark their novel approach with other recommendation algorithms such as collaborative filtering to see if there is a lift in predictive capabilities.<br />
# The listening experiment used to evaluate the recommendation system only includes songs that are outputted by the model. Users may be biased if they believe all songs have come from a recommendation system. To remove bias, we suggest having 15 songs where 5 songs are recommended and 10 songs are set. With this in the user’s mind, it may remove some bias in response and give more accurate predictive capabilities. <br />
# It would be better if they go into more details about how CRNN makes it perform better than CNN, in terms of attributes of each network. Also, it might be helpful to include the motivations why CNN and CRNN are useful in this specific problem. <br />
# The methodology introduced in this paper is probably also suitable for movie recommendations. As music is presented as spectrograms (images) in a time sequence, and it is very similar to a movie. <br />
# The way of evaluation is a very interesting approach. Since it's usually not easy to evaluate the testing result when it's subjective. By listing all these evaluations' performance, the result would be more comprehensive. A practice that might reduce bias is by coming back to the participants after a couple of days and asking whether they liked the music that was recommended. Often times music "grows" on people and their opinion of a new song may change after some time has passed. <br />
# The paper lacks the comparison between the proposed algorithm and the music recommendation algorithms being used now. It will be clearer to show the superiority of this algorithm.<br />
# The GAN neural network has been proposed to enhance the performance of the neural network, so an improved result may appear after considering using GAN.<br />
# The limitation of CNN and CRNN could be that they are only able to process the spectrograms with single labels rather than multiple labels. This is far from enough for the music recommender systems in today's music industry since the edges between various genres are blurred.<br />
# Is it possible for CNN and CRNN to identify different songs? The model would be harder to train, based on my experience, the efficiency of CNN in R is not very high, which can be improved for future work.<br />
# According to the author, the recommender system is done by calculating the cosine similarity of extraction features from one music to another music. Is possible to represent it by Euclidean distance or p-norm distances?<br />
# In real-life application, most of the music software will have the ability to recommend music to the listener and ask do they like the music that was recommended. It would be a nice application by involving some new information from the listener.<br />
# Actual music listeners do not listen to one genre of music, and in fact listening to the same track or the same genre would be somewhat unusual. Could this method be used to make recommendations not on genre, but based on other categories? (Such as the theme of the lyrics, the pitch of the singer, or the date published). Would this model be able to differentiate between tracks of varying "lyric vocabulation difficulty"? Or would NLP algorithms be needed to consider lyrics?<br />
# This model can be applied to many other fields such as recommending the news in the news app, recommending things to buy in the amazon, recommending videos to watch in YOUTUBE and so on based on the user information.<br />
# Looks like for the most genres, CRNN outperforms CNN, but CNN did do better on a few genres (like Jazz), so it might be better to mix them together or might use CNN for some genres and CRNN for the rest.<br />
# Cosine similarity is used to find songs with similar patterns as the input ones from users. That is, feature variables are extracted from the trained neural network model before the classification layer, and used as the basis to find similar songs. One potential problem of this approach is that if the neural network classifies an input song incorrectly, the extracted feature vector will not be a good representation of the input song. Thus, a song that is in fact really similar to the input song may have a small cosine similarity value, i.e. not be recommended. In conclusion, if the first classification is wrong, future inferences based on that is going to make it deviate further from the true answer. A possible future improvement will be how to offset this inference error.<br />
# In the tables when comparing performance and accuracies of the CNN and CRNN models on different genres of music, the researchers claimed that CRNN had superior performance to CNN models. This seemed intuitive, especially in the cases when the differences in accuracies were large. However, maybe the researchers should consider including some hypothesis testing statistics in such tables, which would support such claims in a more rigorous manner.<br />
# A music recommender system that doesn't use the song's meta data such as artist and genre and rather tries to classify genre itself seems unproductive. I also believe that the specific artist matters much more than the genre since within a genre you have many different styles. It just seems like the authors hamstring their recommender system by excluding other relevant data.<br />
# The genres that are posed in the paper are very broad and may not be specific enough to distinguish a listeners actual tastes (ie, I like rock and roll, but not punk rock, which could both be in the "rock" category). It would be interesting to run similar experiments with more concrete and specific genres to study the possibility of improving accuracy in the model.<br />
# This summary is well organized with detailed explanation to the music recommendation algorithm. However, since the data used in this paper is cleaned to buffer the efficiency of the recommendation, there should be a section evaluating the impact of noise on the performance this algorithm and how to minimize the impact.<br />
# This method will be better if the user choose some certain music genres that they like while doing the sign-up process. This is similar to recommending articles on twitter.<br />
# I have some feedback for the "Evaluation of Music Recommendation System" section. Firstly, there can be a brief mention of the participants' background information. Secondly, the summary mentions that "participants choose 5 pieces of music they enjoyed". Are they free to choose any music they like, or are they choosing from a pool of selections? What are the lengths of these music pieces? Lastly, method one and method two are compared against each other. It's intuitive that method two will outperform method one, since method two makes use of both cosine similarity and information on music genre, whereas method one only makes use of cosine similarity. Thus, saying method two outperforms method one is not necessarily surprising. I would like to see more explanation on why these methods are chosen, and why comparing them directly is considered to be fair.<br />
# It would be better to have more comparison with other existing music recommender system.<br />
# In the Collecting Music Data section, the author has indicated that for maintaining the balance of data for each genre that they are choosing to omit some genres and a portion of the dataset. However, how this was done was not explained explicitly which can be a concern for results replication. It would be better to describe the steps and measures taken to ensure the actions taken by the teams are reproducible. <br />
# For cleaning data, for training purposes, the team is choosing to omit the ones with lower music quality. While this is a sound option, it can be adjusted that the ratings for the music are deducted to adjust the balance. This could be important since a poor music quality could mean either equipment failure or corrupt server storage or it was a recording of a live performance that often does not have a perfect studio quality yet it would be loved by many real-life users. This omission is not entirely justified and feels like a deliberate adjustment for later results.<br />
# It would be more convincing if the author could provide more comparison between CRNN and CNN.<br />
# How is the result used to recommend songs within genres? It looks like it only predicts what genre the user likes to listen and recommends one of the songs from that genre. How can this recommender system be used to recommend songs within the same genre?<br />
# This [https://arxiv.org/pdf/2006.15795.pdf paper] implements CRNN differently; the CNN and RNN are separate and their resulting matrices and combined later. Would using this version of the CRNN potentially improve the accuracy?<br />
# This kind of approach can be used in implementing other recommender systems for, like movies, articles, news, websites etc. It would be helpful if the author could explain and generalize the implementation on other forms of recommender systems.<br />
# The accuracy of the genre classifier seemed really low, considering how distinct the genres sound to humans. The authors recommend adding features to the data but these could likely be extracted from the audio signal. Extra preprocessing would likely go a long way to improve the accuracy.<br />
# Since it was mentioned that different genres were used, it would be interesting to know if the model can classify different languages and how it performs with songs in different languages.<br />
# It is possible to extend this application to classifying baroque, classical, and romantic genre music. This can be beneficial for students (and frankly, people of all ages) who are learning about music. What's even more interesting to see is if this algorithm can distinguish music pieces written by classical musicians such as Beethoven, Haydn, and Mozart. Of course, it would take more effort in distinguishing features across the music pieces of these three artists, but it's an area worth exploring.<br />
# In contrast to the mel spectrogram method, the popular streaming app Spotify allows you to view data they collect that includes features about the nature of the of song such as acousticness, danceability, loudness, tempo, etc.<br />
# The authors introduced a good recommendation system. It might be helpful to evaluate just based on how similar users are. Meanwhile, there might be some bias need to be considered in the dataset. PCA might be a good way to analyze the existing pattern.<br />
# This models needs a way to introduce variance into its recommendations. The kind of music you wnat to listen to is largely dependent on non quantifiable factors like mood, but there is evidence to support that it depends on current events/activities. A way to introduce those factors into the recommendation system need to be looked at.<br />
# The music recommender is based on big data of what the users listen to. It would be better if the summary can explain more about how this problem is very similar to image detection problem, and how they are related to each other. This way it is better for the audiences to make connection.<br />
<br />
== References: ==<br />
Nilashi, M., et.al. ''Collaborative Filtering Recommender Systems''. Research Journal of Applied Sciences, Engineering and Technology 5(16):4168-4182, 2013.<br />
Adiyansjah, Alexander A S Gunawan, Derwin Suhartono, Music Recommender System Based on Genre using Convolutional Recurrent Neural Networks, Procedia Computer Science, https://doi.org/10.1016/j.procs.2019.08.146.</div>J2battishttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISIONCRITICAL ANALYSIS OF SELF-SUPERVISION2020-11-25T06:03:50Z<p>P2torabi: </p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find good generalized image representations. <br />
In self-supervised learning, unlabeled data is used to generate ground truth labels, such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below. The intuition is that if a deep network can tell if a bird is upside down or not, perhaps it has learned a semantically relevant representation without the need for hand-labelling.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods. <br />
<br />
* Generative models: Generative Adversarial Networks (GANs), learn to generate images in an adversarial manner. They consist of a generator network which maps noise samples to image samples and a discriminator network whose task is to distinguish the fake images from the real ones. These two are trained together until the point where the fake images are indistinguishable. BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. <br />
* In RotNet method [3], images are rotated and the CNN learns to figure out the direction. Therefore, this task is a 4-way classification task. Most images are taken upright which could be considered as labeled images with label 90 degrees. The authors of RotNet argue that the concept of 'upright' is hard to understand and requires high-level knowledge about the image, so this task encourages the network to discover more complex information about the images. <br />
* DeepCluster [4] alternates between k-means clustering step, in which pseudo-labels are assigned to the data by k-means on the PCA-reduced features, and the learning step in which the model tries to learn to fit the representation to these labels(cluster IDs) under several image transformations. These transformations include random resized crops with <math> \beta = 0.08 </math> and <math> \gamma = \frac{3}{4}</math> and horizontal flips.<br />
<br />
* In Jigsaw task [6], the unlabelled images are divided into nine patches and then, the patches are permuted randomly to create a new image. Then, a deep neural network is trained to predict the permutation of patches in the perturbed image.<br />
<br />
Following is the work done in the domain of learning from a single image:<br />
<br />
* Rodriguez et al. [7] used max-margin correlation filters to learn robust tracking templates from a single sample of the patch.<br />
* Malisiewicz et al. [8] used a semi-parametric exemplar SVM model where the model uses one positive sample and separates it from thousands of negative samples mined from the background.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner. The author uses the ResNet-50 to compute the image and the transpose of this image. The method is evaluated by multiple datasets, and the tasks majorly focus on object detection and image classification. Jigsaw ResNet-50, introduced by Priya Goyal, was utilized as a baseline of the experiment. <br />
<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
To measure the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned. The discrimination power at each layer under self-supervision is then compared to that of a fully supervised model classically trained.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
=== Choice of augmentations ===<br />
<br />
Here we describe how <math>N</math> surce images get expanded to an additional <math>d-N</math>images, where <math>d</math> is much larger and independent to <math>N</math>. <br />
<br />
Given a source image of size <math>H \times W</math>, extract random patches of size <math>(w,h)</math>. Set <math>\beta , \gamma </math> such that <math>\beta \leq \frac{wh}{WH}</math> and <math>\gamma \leq \frac{h}{w} \leq \gamma^{-1}</math>. The smalles size of crops is at least <math>\beta WH</math>. Changes in aspect ratio are limited by <math>\gamma</math>. In practice <math>\beta = 0.0001, \gamma = 0.75</math> are good choices.<br />
<br />
Second, images are rotated by <math>\alpha</math> degrees, where <math>-35 \leq \alpha \leq 35</math>. Images are flipped with 50% probability.<br />
<br />
Finally, colour and intensity of single pixels are linearly transformed to provide changes of illumination, as is common in natural images.<br />
<br />
=== Quantitative Analysis ===<br />
They compared the learned filters of all first-layer convolutions of an AlexNet trained with the different methods and a single image. Showed how the results of retraining a network with the first two convolutional filters, or the scattering transform from (Oyallon et al., 2017), left frozen. They also observed that their single image trained DeepCluster and BiGAN models achieve performances closes to the supervised benchmark. Lastly, they show how their features trained on only a single image can be used for other applications.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable using a single image, as compared to fully supervised performance using the entire dataset. Table 1 indicates the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
[[File:Capture123.PNG|500px|center]]<br />
<div align="center">'''Table1 :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
<br />
[[File:critical_analysis.png|500px|center]]<br />
<br />
The above table (Table 3) corresponds to the Accuracy of linear classifiers on different network layers on CIFAR-10 and CIFAR-100 datasets.<br />
<br />
[[File:pretrain.png|500px|center]]<br />
<br />
In table 4, the authors fine-tuned a convolution neural network with the first two filters left frozen. They achieved almost benchmark results with just a single image. This tells us that a single image is sufficient for training the first two convolutional filter banks.<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conducted interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Thus, current unsupervised learning benefits from data augmentation more than a larger dataset. The results seem to indicate that we probably do not use the full semantic capacity of a million images yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet? Also, the authors could try other types of Self-Supervised tasks such as jigsaw task and state-of-the-art PIRL [8].<br />
<br />
It would be interesting to consider and compare the effects of each augmentation strategy in terms of performance. Additionally, it may be worthwhile to try other augmentation techniques like Gaussian smoothing and see the impact on the learning performance. <br />
<br />
It would be really beneficial to apply a more challenging dataset, with objects in clutter, occlusion, and wider pose variation, inter-image invariance can be more effective, as it is used in this paper [10]. It will help us to understand the author's methodology if it encourages intra image invariance, unlike the objective of contrastive learning like the proposed in [10] or not.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[7] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-SVMs for object detection and beyond. In<br />
Proc. ICCV, 2011.<br />
<br />
[8] A. Rodriguez, V. Naresh Boddeti, BVK V. Kumar, and A. Mahalanobis. Maximum margin correlation filter: A new approach for localization and classification. IEEE Transactions on Image Processing, 22(2):631–643, 2013<br />
<br />
[9] I. Misra and L. van der Maaten, "Self-Supervised Learning of Pretext-Invariant Representations," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.<br />
<br />
[10] Cheng, Z., Su, J.-C., and Maji, S., “Unsupervised Discovery of Object Landmarks via Contrastive Learning”, <i>arXiv e-prints</i>, 2020.</div>Mrasoolihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Research_Papers_Classification_SystemResearch Papers Classification System2020-11-24T22:02:43Z<p>J27ni: /* Critique */</p>
<hr />
<div>= Presented by =<br />
Jill Wang, Junyi (Jay) Yang, Yu Min (Chris) Wu, Chun Kit (Calvin) Li<br />
<br />
= Introduction =<br />
With the increasing advance of computer science and information technology, an increasingly overwhelming number of papers have been published. Due to the great volume, it has become incredibly hard to find and categorize papers. Typically this is a very time consuming activity. This paper introduces a paper classification system that utilizes the Term Frequency-Inverse Document Frequency (TF-IDF), Latent Dirichlet Allocation (LDA), and K-means clustering. The most important technology the system used to process big data is the Hadoop Distributed File System (HDFS). The system is highly fault-tolerant and can be deployed on low-cost hardware. It can handle quantitatively complex research paper classification problems efficiently and accurately.<br />
<br />
===General Framework===<br />
<br />
The paper classification system classifies research papers based on the abstracts given that the core of most papers is presented in the abstracts. <br />
<br />
[[ File:Systemflow.png |right|image on right| 400px]]<br />
<ol><li>Paper Crawling <br />
<p>Collects abstracts and keywords from research papers published during a given period</p></li><br />
<li>Preprocessing<br />
<p> <ol style="list-style-type:lower-alpha"><li>Removes stop words in the papers crawled, in which only nouns are extracted from the papers</li><br />
<li>generates a keyword dictionary, keeping only the top-N keywords with the highest frequencies</li> </ol><br />
</p></li> <br />
<li>Topic Modelling<br />
<p> Use the LDA to group the keywords into topics</p><br />
</li><br />
<li>Paper Length Calculation<br />
<p> Calculates the total number of occurrences of words to prevent an unbalanced TF values caused by the various length of abstracts using the map-reduce algorithm</p><br />
</li><br />
<li>Word Frequency Calculation<br />
<p> Calculates the Term Frequency (TF) values which represent the frequency of keywords in a research paper</p><br />
</li><br />
<li>Document Frequency Calculation<br />
<p> Calculates the Document Frequency (DF) values which represents the frequency of keywords in a collection of research papers. The higher the DF value, the lower the importance of a keyword.</p><br />
</li><br />
<li>Term frequency-inverse document frequency (TF-IDF) calculation<br />
<p> Calculates the inverse of the DF which represents the importance of a keyword.</p><br />
</li><br />
<li>Paper Classification<br />
<p> Classify papers by topics and similar subjects using the K-means clustering algorithm.</p><br />
</li><br />
</ol><br />
<br />
===Technologies===<br />
<br />
The HDFS with a Hadoop cluster composed of one master node, one sub-node, and four data nodes is what is used to process the massive paper data. Hadoop-2.6.5 version in Java is what is used to perform the TF-IDF calculation. Spark MLlib is what is used to perform the LDA. The Scikit-learn library is what is used to perform the K-means clustering.<br />
<br />
===HDFS===<br />
<br />
Hadoop Distributed File System (HDFS) was used to process big data in this system. HFDS has been shown to process big data rapidly and stably with high scalability which makes it a perfect choice for this problem. What Hadoop does is to break a big collection of data into different partitions and pass each partition to one individual processor. Each processor will only have information about the partition of data it has received.<br />
<br />
'''In this summary, we are going to focus on introducing the main algorithms of what this system uses, namely LDA, TF-IDF, and K-Means.'''<br />
<br />
=Data Preprocessing=<br />
===Crawling of Abstract Data===<br />
<br />
Under the assumption that audiences tend to first read the abstract of a paper to gain an overall understanding of the material, it is reasonable to assume the abstract section includes “core words” that can be used to effectively classify a paper's subject.<br />
<br />
An abstract is crawled to have its stop words removed. Stop words are words that are usually ignored by search engines, such as “the”, “a”, and etc. Afterward, nouns are extracted, as a more condensed representation for efficient analysis.<br />
<br />
This is managed on HDFS. The TF-IDF value of each paper is calculated through map-reduce, an easy-to-use programming model and implementation for processing and generating large data sets. The user must specify (i) a map procedure, that filters and sorts the input data to produce a set of intermediate key/value pairs and (ii) a reduce function, which performs a summary operation on the intermediate values with the same key and returns a smaller set of output key/value pairs. The MapReduce interface enables this process by grouping the intermediate values with the same key and passing them as input to the reduce function. For example, one could count the number of times various words appear in a large number of documents by setting your map procedure to count the number of occurrences of each word in a single document, and your reduce function to sum all counts of a given word [[https://dl.acm.org/doi/pdf/10.1145/1327452.1327492?casa_token=_Zg_DWxQzKEAAAAA:EHII0CaP36_ojGMT8huqTGLNMSEc-CKzZAoXBxSXe6pr2WB0DCQvEKa30CFQW0NSbB2-CVo8GcBcJAg 1]].<br />
<br />
===Managing Paper Data===<br />
<br />
To construct an effective keyword dictionary using abstract data and keywords data in all of the crawled papers, the authors categorized keywords with similar meanings using a single representative keyword. The approach is called stemming, which is common in cleaning data, where words are reduced to their word stem. An example is "running" and "ran" would be reduced to "run". 1394 keyword categories are extracted, which is still too much to compute. Hence, only the top 30 keyword categories are used.<br />
<br />
<div align="center">[[File:table_1_kswf.JPG|700px]]</div><br />
<br />
=Topic Modeling Using LDA=<br />
<br />
Latent Dirichlet allocation (LDA) is a generative probabilistic model that views documents as random mixtures over latent topics. Each topic is a distribution over words, and the goal is to extract these topics from documents.<br />
<br />
LDA estimates the topic-word distribution <math>P\left(t | z\right)</math> (probability of word "z" having topic "t") and the document-topic distribution <math>P\left(z | d\right)</math> (probability of finding word "z" within a given document "d") using Dirichlet priors for the distributions with a fixed number of topics. For each document, obtain a feature vector:<br />
<br />
\[F = \left( P\left(z_1 | d\right), P\left(z_2 | d\right), \cdots, P\left(z_k | d\right) \right)\]<br />
<br />
In the paper, authors extract topics from preprocessed paper to generate three kinds of topic sets, each with 10, 20, and 30 topics respectively. The following is a table of the 10 topic sets of highest frequency keywords.<br />
<br />
<div align="center">[[File:table_2_tswtebls.JPG|700px]]</div><br />
<br />
<br />
===LDA Intuition===<br />
<br />
LDA uses the Dirichlet priors of the Dirichlet distribution, which allows the algorithm to model a probability distribution ''over prior probability distributions of words and topics''. The following picture illustrates 2-simplex Dirichlet distributions with different alpha values, one for each corner of the triangles. <br />
<br />
<div align="center">[[File:dirichlet_dist.png|700px]]</div><br />
<br />
Simplex is a generalization of the notion of a triangle in k-1 dimension where k is the number of classes. For example, if you wish to classify essays into three groups, English, History and Math then the simplex would be a 2 dimension triangle, if you add philosophy as one of your potential class, then we would need a tetrahedron in 3 deminsion. In Dirichlet distribution, each parameter will be represented by a corner in simplex, so adding additional parameters implies increasing the dimensions of simplex. As illustrated, when alphas are smaller than 1 the distribution is dense at the corners. When the alphas are greater than 1 the distribution is dense at the centers.<br />
<br />
The following illustration shows an example LDA with 3 topics, 4 words and 7 documents.<br />
<br />
<div align="center">[[File:LDA_example.png|800px]]</div><br />
<br />
In the left diagram, there are three topics, hence it is a 2-simplex. In the right diagram there are four words, hence it is a 3-simplex. LDA essentially adjusts parameters in Dirichlet distributions and multinomial distributions (represented by the points), such that, in the left diagram, all the yellow points representing documents and, in the right diagram, all the points representing topics, are as close to a corner as possible. In other words, LDA finds topics for documents and also finds words for topics. At the end topic-word distribution <math>P\left(t | z\right)</math> and the document-topic distribution <math>P\left(z | d\right)</math> are produced.<br />
<br />
=Term Frequency Inverse Document Frequency (TF-IDF) Calculation=<br />
<br />
TF-IDF is widely used to evaluate the importance of a set of words in the fields of information retrieval and text mining. It is a combination of term frequency (TF) and inverse document frequency (IDF). The idea behind this combination is<br />
* It evaluates the importance of a word within a document<br />
* It evaluates the importance of the word among the collection of all documents<br />
<br />
The inverse of the document frequency accounts for the fact that term frequency will naturally increase as document frequency increases. Thus IDF is needed to counteract a word's TF to give an accurate representation of a word's importance.<br />
<br />
The TF-IDF formula has the following form:<br />
<br />
\[TF-IDF_{i,j} = TF_{i,j} \times IDF_{i}\]<br />
<br />
where i stands for the <math>i^{th}</math> word and j stands for the <math>j^{th}</math> document.<br />
<br />
===Term Frequency (TF)===<br />
<br />
TF evaluates the percentage of a given word in a document. Thus, TF value indicates the importance of a word. The TF has a positive relation with the importance.<br />
<br />
In this paper, we only calculate TF for words in the keyword dictionary obtained. For a given keyword i, <math>TF_{i,j}</math> is the number of times word i appears in document j divided by the total number of words in document j.<br />
<br />
The formula for TF has the following form:<br />
<br />
\[TF_{i,j} = \frac{n_{i,j} }{\sum_k n_{k,j} }\]<br />
<br />
where i stands for the <math>i^{th}</math> word, j stands for the <math>j^{th}</math> document, <math>n_{i,j}</math> stands for the number of times words <math>t_i</math> appear in document <math>d_j</math> and <math>\sum_k n_{k,j} </math> stands for total number of occurence of words in document <math>d_j</math>.<br />
<br />
Note that the denominator is the total number of words remaining in document j after crawling.<br />
<br />
===Document Frequency (DF)===<br />
<br />
DF evaluates the percentage of documents that contain a given word over the entire collection of documents. Thus, the higher DF value is, the less important the word is.<br />
<br />
<math>DF_{i}</math> is the number of documents in the collection with word i divided by the total number of documents in the collection. The formula for DF has the following form:<br />
<br />
\[DF_{i} = \frac{|d_k \in D: n_{i,k} > 0|}{|D|}\]<br />
<br />
where <math>n_{i,k}</math> is the number of times word i appears in document k, |D| is the total number of documents in the collection.<br />
<br />
Since DF and the importance of the word have an inverse relation, we use inverse document frequency (IDF) instead of DF.<br />
<br />
===Inverse Document Frequency (IDF)===<br />
<br />
In this paper, IDF is calculated in a log scale. Since we will receive a large number of documents, i.e, we will have a large |D|<br />
<br />
The formula for IDF has the following form:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|}{|\{d_k \in D: n_{i,k} > 0\}|}\right)\]<br />
<br />
As mentioned before, we will use HDFS. The actual formula applied is:<br />
<br />
\[IDF_{i} = log\left(\frac{|D|+1}{|\{d_k \in D: n_{i,k} > 0\}|+1}\right)\]<br />
<br />
The inverse document frequency gives a measure of how rare a certain term is in a given document corpus.<br />
<br />
=Paper Classification Using K-means Clustering=<br />
<br />
K-means clustering is an unsupervised classification algorithm that groups similar data into the same class. It is an efficient and simple method that can be applied to different types of data attributes. It is also flexible enough to handle various kinds of noise and outliers.<br />
<br><br />
<br />
Given a set of <math>d</math> by <math>n</math> dataset <math>\mathbf{X} = \left[ \mathbf{x}_1 \cdots \mathbf{x}_n \right]</math>, the algorithm will assign each <math>\mathbf{x}_j</math> into <math>k</math> different clusters based on the characteristics of <math>\mathbf{x}_j</math> itself.<br />
<br><br />
<br />
Moreover, when assigning data into a cluster, the algorithm will also try to minimise the distances between the data and the centre of the cluster which the data belongs to. That is, k-means clustering will minimize the sum of square error:<br />
<br />
\begin{align*}<br />
min \sum_{i=1}^{k} \sum_{j \in C_i} ||x_j - \mu_i||^2<br />
\end{align*}<br />
<br />
where<br />
<ul><br />
<li><math>k</math>: the number of clusters</li><br />
<li><math>C_i</math>: the <math>i^th</math> cluster</li><br />
<li><math>x_j</math>: the <math>j^th</math> data in the <math>C_i</math></li><br />
<li><math>mu_i</math>: the centroid of <math>C_i</math></li><br />
<li><math>||x_j - \mu_i||^2</math>: the Euclidean distance between <math>x_j</math> and <math>\mu_i</math></li><br />
</ul><br />
<br><br />
<br />
K-means Clustering algorithm, an unsupervised algorithm, is chosen because of its advantages to deal with different types of attributes, to run with minimal requirement of domain knowledge, to deal with noise and outliers, to realize clusters with similarities. <br />
<br />
<br />
Since the goal for this paper is to classify research papers and group papers with similar topics based on keywords, the paper uses the K-means clustering algorithm. The algorithm first computes the cluster centroid for each group of papers with a specific topic. Then, it will assign a paper into a cluster based on the Euclidean distance between the cluster centroid and the paper’s TF-IDF value.<br />
<br><br />
<br />
However, different values of <math>k</math> (the number of clusters) will return different clustering results. Therefore, it is important to define the number of clusters before clustering. For example, in this paper, the authors choose to use the Elbow scheme to determine the value of <math>k</math>. The Elbow scheme is a somewhat subjective way of choosing an optimal <math>k</math> that involves plotting the average of the squared distances from the cluster centers of the respective clusters (distortion) as a function of <math>k</math> and choosing a <math>k</math> at which point the decrease in distortion is outweighed by the increase in complexity. Also, to measure the performance of clustering, the authors decide to use the Silhouette scheme. The Silhouette scheme is a measure of how well the objects lie within each cluster. Silhouette scores lie from -1 to 1. A positive score indicates that the object is well-matched with its own cluster, while a negative score indicates the opposite (Kaufman & Rousseeuw, 2005). The results of clustering are validated if the Silhouette scheme returns a value greater than <math>0.5</math>.<br />
<br />
=System Testing Results=<br />
<br />
In this paper, the dataset has 3264 research papers from the Future Generation Computer System (FGCS) journal between 1984 and 2017. For constructing keyword dictionaries for each paper, the authors have introduced three methods as shown below:<br />
<br />
<div align="center">[[File:table_3_tmtckd.JPG|700px]]</div><br />
<br />
<br />
Then, the authors use the Elbow scheme to define the number of clusters for each method with different numbers of keywords before running the K-means clustering algorithm. The results are shown below:<br />
<br />
<div align="center">[[File:table_4_nocobes.JPG|700px]]</div><br />
<br />
According to Table 4, there is a positive correlation between the number of keywords and the number of clusters. In addition, method 3 combines the advantages for both method 1 and method 2; thus, method 3 requires the least clusters in total. On the other hand, the wrong keywords might be presented in papers; hence, it might not be possible to group papers with similar subjects correctly by using method 1 and so method 1 needs the most number of clusters in total.<br />
<br />
<br />
Next, the Silhouette scheme had been used for measuring the performance for clustering. The average of the Silhouette values for each method with different numbers of keywords are shown below:<br />
<br />
<div align="center">[[File:table_5_asv.JPG|700px]]</div><br />
<br />
Since the clustering is validated if the Silhouette’s value is greater than 0.5, for methods with 10 and 30 keywords, the K-means clustering algorithm produces good results.<br />
<br />
<br />
To evaluate the accuracy of the classification system in this paper, the authors use the F-Score. The authors execute 5 times of experiment and use 500 randomly selected research papers for each trial. The following histogram shows the average value of F-Score for the three methods and different numbers of keywords:<br />
<br />
<div align="center">[[File:fig_16_fsvotm.JPG|700px]]</div><br />
<br />
Note that “TFIDF” means method 1, “LDA” means method 2, and “TFIDF-LDA” means method 3. The number 10, 20, and 30 after each method is the number of keywords the method has used.<br />
According to the histogram above, method 3 has the highest F-Score values than the other two methods with different numbers of keywords. Therefore, the classification system is most accurate when using method 3 as it combines the advantages for both method 1 and method 2.<br />
<br />
=Conclusion=<br />
<br />
This paper introduces a classification system that classifies research papers into different topics by using TF-IDF and LDA scheme with K-means clustering algorithm. The experimental results showed that the proposed system could classify the papers with similar subjects according to the keywords extracted from the abstracts of papers. In particular, when a keyword dictionary with both of the keywords extracted from the abstracts and the topics extracted by the LDA scheme is used, the classification system has a better clustering performance and higher F-Score values. The authors emphasized that the system can be implemented efficiently on high-performance computing infrastructure, using industry-standard technologies. This system allows users to search for the papers they want quickly and with the most productivity.<br />
<br />
Furthermore, this classification system might also be used in different types of texts (e.g., documents, tweets, etc.) instead of only classifying research papers.<br />
<br />
=Critique=<br />
<br />
In this paper, DF values are calculated within each partition. This results that for each partition, DF value for a given word will vary and may have an inconsistent result for different partition methods. As mentioned above, there might be a divide by zero problem since some partitions do not have documents containing a given word, but this can be solved by introducing a dummy document as the authors did. Another method that might be better at solving inconsistent results and the divide by zero problems is to have all partitions to communicate with their DF value. Then pass the merged DF value to all partitions to do the final IDF and TF-IDF value. Having all partitions to communicate with the DF value will guarantee a consistent DF value across all partitions and helps avoid a divide by zero problem as words in the keyword dictionary must appear in some documents in the whole collection.<br />
<br />
This paper treated the words in the different parts of a document equivalently, it might perform better if it gives different weights to the same word in different parts. For example, if a word appears in the title of the document, it usually shows it's a main topic of this document so we can put more weight on it to categorize.<br />
<br />
When discussing the potential processing advantages of this classification system for other types of text samples, has the effect of processing mixed samples (text and image or text and video) taken into consideration? IF not, in terms of text classification only, does it have an overwhelming advantage over traditional classification models?<br />
<br />
The preprocessing should also include <math>n</math>-gram tokenization for topic modelling because some topics are inherently two words, such as machine learning where if it is seen separately, it implies different topics.<br />
<br />
This system is very compute-intensive due to the large volumes of dictionaries that can be generated by processing large volumes of data. It would be nice to see how much data HDFS had to process and similarly how much time was saved by using Hadoop for data processing as opposed to centralized approach.<br />
<br />
This system can be improved further in terms of computation times by utilizing other big data framework MapReduce, that can also use HDFS, by parallelizing their computation across multiple nodes for K-means clustering as discussed in (Jin, et al) [5].<br />
<br />
It's not exactly clear what method 3 (TFIDF-LDA) is doing, how is it performing TF-IDF on the topics? Also it seems like the preprocessing step only keeps 10/20/30 top words? This seems like an extremely low number especially in comparison with the LDA which has 10/20/30 topics - what is the reason for so strongly limiting the number of words? It would also be interesting to see if both key words and topics are necessary - an ablation study showing the significance of both would be interesting.<br />
<br />
It is better if the paper has an example with some topics on some research papers. Also it is better if we can visualize the distance between each research paper and the topic names<br />
<br />
I am interested in the first step of the general framework, which is the Paper Crawling step. Many conferences actually require the authors to indicate several key words that best describe a paper. For example, a database paper may have keywords such as "large-scale database management", "information retrieval", and "relational table mining". So in addition to crawling text from abstract, it may be more effective to crawl these keywords directly. Not only does this require less time, these keywords may also lead to better performance than the nouns extracted from the abstract section. I am also slightly concerned about the claim made in the paper that "Our methodologies can be applied to text outside of research papers". Research papers are usually carefully revised and well-structured. Extending the algorithm described in the paper to any kind of free-text could be difficult in practice.<br />
<br />
The paper has very meaningful motivation, since the association of research topics and finding all the relevant previous work is indeed a challenging task at the initials stage of the research. It is often easy to miss a relevant paper published years ago which might be crucial to your own work. However, the classification task that the author tested in this work is almost useless, as the classification is too high-level. The overall scheme of classifying paper between categories like "cloud bigdata" or "IoT privacy" is too general to be meaningful. It is simply classifying the primary field computer science into its direct subfield, while most researchers only work on a niche much narrower than the subfield. Most online paper database, including arxiv, takes care of the subfield and even subsubfield classifcation during the stage of submission, which leaves the author's system with limited applicabilty. What we truly need is an algorithm able to classify and cluster papers based on detailed research topics and methodology. <br />
<br />
It would be better if the author could provide some application or example of the research algorithm in the real world. It would be helpful for the readers to understand the algorithm.<br />
<br />
The summary clearly goes through the model framework well, starting from data-preprocessing, prediction, and testing. It can be enhanced by applying this model to other similar use-cases and how well the prediction goes.<br />
<br />
It will be better if their is a comparison on the BM25 algorithm v.s. TF-IDF, which is usually get compared in IR papers<br />
<br />
The paper misses the details on subjects of research papers used to perform classifications. If the majority of research papers were about one subject, it could potentially produce biased results.<br />
<br />
The paper omits the details of the reason why Method 3 for constructing the Keyword dictionaries requires the least number of k-clusters as method 3 is a combination of methods 1 and 2. It would be of interest to investigate why Method 3 uses so little clusters (in comparison) as it seemed to be the most accurate of the 3 methods. (Also the graph comparing the results could be improved by using a variety of different hues of colours as it is difficult to distinguish some scores such as TFIDF_30 and TFIDF-LDA_30)<br />
<br />
The TF-IDF is interesting as it provides a normalized method to extract the most frequent term contained in the paper, while this method still has spaces of improvements. For example, in some machine learning papers, where special operations have to be done on the datasets, the name of dataset may appear multiple times within the paper. In fact, the main theme of the paper is on the novel machine learning algorithm, which may only be mentioned once. In that case, mis-predictions may occur, and a possible improvement here is to add weights to keywords appearing in each section. i.e the most frequent word in Abstract will have more weights than the most frequent word in Introduction.<br />
<br />
In my opinion, the paper glosses over a few technicalities. First, how does the proposed algorithm deal with subgroups and nested groups. The paper is assuming only one level of sorting, which may work for a sufficiently unique set of paper, but since the problem is meant to be generalized, many papers will have to have multi-level sorts. For example, the category 'machine learning' can be further divided into 'supervised' and 'unsupervised'. Is the algorithm able to handle this or would it create 2 groups (i.e. ML-supervised and ML-unsupervised)? Second, a popular LDA model is available through the gensim package which utilizes relevancy and saliency metrics - how does that factor into the quality of the topics? Third, what is the motivation in using TF-IDF scores for clustering? In my experience, using Word2Vec and BERT has been the industry standard for obtaining vectors to perform clustering on text.<br />
<br />
When working with a larger data set, Spark might be more efficient than Hadoop. Working with natural language, the preprocessing is very important which might significantly determine the accuracy of the results. Therefore, different preprocessing techniques should be used to have a comparison. Lastly, PCA T-SNE might be a good way to visualize the result data.<br />
<br />
The paper illustrated and interpreted a lot of plots, but this summary is not plot heavy. There is a plot showing different elbow methods when method 3 are used. With different number of keywords and topics, the relationship between elbow and number of clusters is changing, the plot looks a lot steeper to be specific. Also, another thing I would like to mention is that, there may be computer science audience reading the summary, it is a good idea to put the graph that talks about map-reduce algorithm in the summary. This way the algorithm is more clearly explained, and people can get an idea of how the parameters function in the algorithm.<br />
<br />
=References=<br />
<br />
[1] Blei DM, el. (2003). Latent Dirichlet allocation. J Mach Learn Res 3:993–1022<br />
<br />
[2] Gil, JM, Kim, SW. (2019). Research paper classification systems based on TF-IDF and LDA schemes. ''Human-centric Computing and Information Sciences'', 9, 30. https://doi.org/10.1186/s13673-019-0192-7<br />
<br />
[3] Liu, S. (2019, January 11). Dirichlet distribution Motivating LDA. Retrieved November 2020, from https://towardsdatascience.com/dirichlet-distribution-a82ab942a879<br />
<br />
[4] Serrano, L. (Director). (2020, March 18). Latent Dirichlet Allocation (Part 1 of 2) [Video file]. Retrieved 2020, from https://www.youtube.com/watch?v=T05t-SqKArY<br />
<br />
[5] Jin, Cui, Yu. (2016). A New Parallelization Method for K-means. https://arxiv.org/ftp/arxiv/papers/1608/1608.06347.pdf<br />
<br />
[6] Kaufman, L., & Rousseeuw, P. J. (2005). Graphical Output Concerning Each Clustering. In Finding groups in data : An introduction to cluster analysis (pp. 84-85). Hoboken, New Jersey: John Wiley & Sons. doi:10.1002/9780470316801</div>Ck6lihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ModelFramework.jpgModelFramework.jpg2020-11-24T17:14:08Z<p>D287zhan: Created page with "File:ModelFramework.jpg"</p>
<hr />
<div>[[File:ModelFramework.jpg]]</div>D287zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Surround_Vehicle_Motion_PredictionSurround Vehicle Motion Prediction2020-11-23T03:27:12Z<p>J27ni: /* Critiques */</p>
<hr />
<div>DROCC: '''Surround Vehicle Motion Prediction Using LSTM-RNN for Motion Planning of Autonomous Vehicles at Multi-Lane Turn Intersections'''<br />
== Presented by == <br />
Mushi Wang, Siyuan Qiu, Yan Yu<br />
<br />
== Introduction ==<br />
<br />
This paper presents a surrounding vehicle motion prediction algorithm for multi-lane turn intersections using a Long Short-Term Memory-based (LSTM) Recurrent Neural Network (RNN). More specifically, it focused on the improvement of in-lane target recognition and achieving human-like acceleration decisions at multi-lane turn intersections by introducing the learning-based target motion predictor and prediction-based motion predictor. A data-driven approach to predict the trajectory and velocity of surrounding vehicles on urban roads at multi-lane turn intersections was described. LSTM architecture, a specific kind of RNN capable of learning long-term dependencies, is designed to manage complex vehicle motions in multi-lane turn intersections. The results show that the forecaster improves the recognition time of the leading vehicle and contributes to the improvement of prediction ability.<br />
<br />
== Previous Work ==<br />
The autonomous vehicle trajectory approaches previously used motion models like Constant Velocity and Constant Acceleration. These models are linear and are only able to handle straight motions. There are curvilinear models such as Constant Turn Rate and Velocity and Constant Turn Rate and Acceleration which handle rotations and more complex motions. Together with these models, Kalman Filter is used to predicting the vehicle trajectory. Kalman filtering is a common technique used in sensor fusion for state estimation that allows the vehicle's state to be predicted while taking into account the uncertainty associated with inputs and measurements. However, the performance of the Kalman Filter in predicting multi-step problems is not that good. Recurrent Neural Network performs significantly better than it. <br />
<br />
There are 3 main challenges to achieving fully autonomous driving on urban roads, which are scene awareness, inferring other drivers’ intentions, and predicting their future motions. Researchers are developing prediction algorithms that can simulate a driver’s intuition to improve safety when autonomous vehicles and human drivers drive together. To predict driver behavior on an urban road, there are 3 categories for the motion prediction model: (1) physics-based; (2) maneuver-based; and (3) interaction-aware. Physics-based models are simple and direct, which only consider the states of prediction vehicles kinematically. The advantage is that it has minimal computational burden among the three types. However, it is impossible to consider the interactions between vehicles. Maneuver-based models consider the driver’s intention and classified them. By predicting the driver maneuver, the future trajectory can be predicted. Identifying similar behaviors in driving is able to infer different drivers' intentions which are stated to improve the prediction accuracy. However, it still an assistant to improve physics-based models. <br />
<br />
Recurrent Neural Network (RNN) is a type of approach proposed to infer driver intention in this paper. Interaction-aware models can reflect interactions between surrounding vehicles, and predict future motions of detected vehicles simultaneously as a scene. While the prediction algorithm is more complex in computation which is often used in offline simulations. As Schulz et al. indicate, interaction models are very difficult to create as "predicting complete trajectories at once is challenging, as one needs to account for multiple hypotheses and long-term interactions between multiple agents" [6].<br />
<br />
== Motivation == <br />
Research results indicate that little research has been dedicated on predicting the trajectory of intersections. Research regarding intersections have previously concentrated only on maneuver-level predictions of activities such as crossing, stopping or making turns. Trajectory level prediction has also been mainly focused on structured environments that enforce classifiable behavior but has not been extensively researched in multi-lane turn intersections. This is important since different drivers have varied driving behaviors as they travel through intersections. <br />
Moreover, public data sets for analyzing driver behavior at intersections are not enough, and these data sets are not easy to collect. A model is needed to predict the various movements of the target around a multi-lane turning intersection. It is very necessary to design a motion predictor that can be used for real-time traffic.<br />
<br />
<center><br />
[[ File:intersection.png |300px]]<br />
</center><br />
<br />
== Framework == <br />
The LSTM-RNN-based motion predictor comprises of three parts: (1) a data encoder; (2) an LSTM-based RNN; and (3) a data decoder depicts the architecture of the surrounding target trajectory predictor. The proposed architecture uses a perception algorithm to estimate the state of surrounding vehicles, which relies on six scanners. The output predicts the state of the surrounding vehicles and is used to determine the expected longitudinal acceleration in the actual traffic at the intersection. The following image gives a visual representation of the model.<br />
<br />
<center>[[Image:Figure1_Yan.png|800px|]]</center><br />
<br />
== LSTM-RNN based motion predictor == <br />
<br />
=== Sensor Outputs ===<br />
<br />
The input of the target perceptions is from the output of the sensors. The data collected in this article uses 6 different sensors with feature fusion to detect traffic in the range up to 100m: 1) LiDAR system outputs: Relative position, heading, velocity, and box size in local coordinates; 2) Around-View Monitoring (AVM) and 3)GPS outputs: acquire lanes, road marker, global position; 4) Gateway engine outputs: precise global position in urban road environment; 5) Micro-Autobox II and 6) a MDPS are used to control and actuate the subject. All data are stored in an industrial PC.<br />
<br />
=== Data ===<br />
Multi-lane turn intersections are the target roads in this paper. The dataset was collected using a human driven Autonomous Vehicle(AV) that was equipped with sensors to track motion the vehicle's surroundings. In addition, the motion sensors they used a front camera, Around-View-Monitor and GPS to acquire the lanes, road markers and global position. The data was collected in the urban roads of Gwanak-gu, Seoul, South Korea. The training model is generated from 484 tracks collected when driving through intersections in real traffic. The previous and subsequent states of a vehicle at a particular time can be extracted. After post-processing, the collected data, a total of 16,660 data samples were generated, including 11,662 training data samples, and 4,998 evaluation data samples.<br />
<br />
=== Motion predictor ===<br />
This article proposes a data-driven method to predict the future movement of surrounding vehicles based on their previous movement, which is the sequential previous motion. The motion predictor based on the LSTM-RNN architecture in this work only uses information collected from sensors on autonomous vehicles, as shown in the figure below. The contribution of the network architecture of this study is that the future state of the target vehicle is used as the input feature for predicting the field of view. <br />
<br />
<br />
<center>[[Image:Figure7b_Yan.png|500px|]]</center><br />
<br />
<br />
==== Network architecture ==== <br />
A RNN is an artificial neural network that is suitable for use with sequential data because it has recurrent connections on its hidden nodes and thus, can retain its state or memory while processing the next input or sequence of inputs. For this reason, RNNs can be used to analyze time-series data where the pattern of the data depends on the time flow. This is an impossible task for traditional artificial neural networks, which assume the inputs are independent of one another. RNNs can also contain feedback loops that allow activations to flow alternately in the loop. <br />
<br />
In line with traditional neural networks, RNNs still suffer from the problem of vanishing gradients. An LSTM avoids this by making errors flow backward without a limit on the number of virtual layers through the use of forget gates. This property prevents errors from increasing or declining over time, which can make the network train improperly. The figure below shows the various layers of the LSTM-RNN and the number of units in each layer. This structure is determined by comparing the accuracy of 72 RNNs, which consist of a combination of four input sets and 18 network configurations.<br />
<br />
<center>[[Image:Figure8_Yan.png|800px|]]</center><br />
<br />
==== Input and output features ==== <br />
In order to apply the motion predictor to the AV in motion, the speed of the data collection vehicle is added to the input sequence. The input sequence consists of relative X/Y position, relative heading angle, speed of surrounding target vehicles, and speed of data collection vehicles. The output sequence is the same as the input sequence, such as relative position, heading, and speed.<br />
<br />
==== Encoder and decoder ==== <br />
In this study, the authors introduced an encoder and decoder that process the input from the sensor and the output from the RNN, respectively. The encoder normalizes each component of the input data to rescale the data to mean 0 and standard deviation 1, while the decoder denormalizes the output data to use the same parameters as in the encoder to scale it back to the actual unit. <br />
==== Sequence length ==== <br />
The sequence length of RNN input and output is another important factor to improve prediction performance. In this study, 5, 10, 15, 20, 25, and 30 steps of 100 millisecond sampling times were compared, and 15 steps showed relatively accurate results, even among candidates The observation time is very short.<br />
<br />
== Motion planning based on surrounding vehicle motion prediction == <br />
In daily driving, experienced drivers will predict possible risks based on observations of surrounding vehicles, and ensure safety by changing behaviors before the risks occur. In order to achieve a human-like motion plan, based on the model predictive control (MPC) method, a prediction-based motion planner for autonomous vehicles is designed, which takes into account the driver’s future behavior. The cost function of the motion planner is determined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
J = & \sum_{k=1}^{N_p} (x(k|t) - x_{ref}(k|t)^T) Q(x(k|t) - x_{ref}(k|t)) +\\<br />
& R \sum_{k=0}^{N_p-1} u(k|t)^2 + R_{\Delta \mu}\sum_{k=0}^{N_p-2} (u(k+1|t) - u(k|t))^2 <br />
\end{split}<br />
\end{equation*}<br />
where <math>k</math> and <math>t</math> are the prediction step index and time index, respectively; <math>x(k|t)</math> and <math>x_{ref} (k|t)</math> are the states and reference of the MPC problem, respectively; <math>x(k|t)</math> is composed of travel distance px and longitudinal velocity vx; <math>x_{ref} (k|t)</math> consists of reference travel distance <math>p_{x,ref}</math> and reference longitudinal velocity <math>v_{x,ref}</math> ; <math>u(k|t)</math> is the control input, which is the longitudinal acceleration command; <math>N_p</math> is the prediction horizon; and Q, R, and <math>R_{\Delta \mu}</math> are the weight matrices for states, input, and input derivative, respectively, and these weight matrices were tuned to obtain control inputs from the proposed controller that were as similar as possible to those of human-driven vehicles. The weight matrices were used for the controller, for the proposed and conventional motion predictors. The dynamic constraints for the controller were defined as follors <br />
$$\begin{bmatrix} p_{x}(k+1|t)\\ v_{x}(k+1|t)\end{bmatrix} = \begin{bmatrix}1 & dt \\0 &1-dt/\tau \end{bmatrix}\begin{bmatrix} p_{x}(k|t)\\ v_{x}(k|t)\end{bmatrix} + \begin{bmatrix} 0 & dt/\tau\end{bmatrix}u(k|t) $$<br />
The constraints of the control input are defined as follows:<br />
\begin{equation*}<br />
\begin{split}<br />
&\mu_{min} \leq \mu(k|t) \leq \mu_{max} \\<br />
&||\mu(k+1|t) - \mu(k|t)|| \leq S<br />
\end{split}<br />
\end{equation*}<br />
Where <math>u_{min}</math>, <math>u_{max}</math>and S are the minimum/maximum control input and maximum slew rate of input respectively.<br />
<br />
Determine the position and speed boundary based on the predicted state:<br />
\begin{equation*}<br />
\begin{split}<br />
& p_{x,max}(k|t) = p_{x,tar}(k|t) - c_{des}(k|t) \quad p_{x,min}(k|t) = 0 \\<br />
& v_{x,max}(k|t) = min(v_{x,ret}(k|t), v_{x,limit}) \quad v_{x,min}(k|t) = 0<br />
\end{split}<br />
\end{equation*}<br />
Where <math>v_{x, limit}</math> are the speed limits of the target vehicle.<br />
<br />
== Prediction performance analysis and application to motion planning ==<br />
=== Accuracy analysis ===<br />
The proposed algorithm was compared with the results from three base algorithms, a path-following model with <br />
constant velocity, a path-following model with traffic flow and a CTRV model.<br />
<br />
We compare those algorithms according to four sorts of errors, The <math>x</math> position error <math>e_{x,T_p}</math>, <br />
<math>y</math> position error <math>e_{y,T_p}</math>, heading error <math>e_{\theta,T_p}</math>, and velocity error <math>e_{v,T_p}</math> where <math>T_p</math> denotes time <math>p</math>. These four errors are defined as follows:<br />
<br />
\begin{equation*}<br />
\begin{split}<br />
e_{x,Tp}=& p_{x,Tp} -\hat {p}_{x,Tp}\\ <br />
e_{y,Tp}=& p_{y,Tp} -\hat {p}_{y,Tp}\\ <br />
e_{\theta,Tp}=& \theta _{Tp} -\hat {\theta }_{Tp}\\ <br />
e_{v,Tp}=& v_{Tp} -\hat {v}_{Tp}<br />
\end{split}<br />
\end{equation*}<br />
<center>[[Image:Figure10.1_YanYu.png|500px|]]</center><br />
<br />
The proposed model shows significantly fewer prediction errors compare to the based algorithms in terms of mean, <br />
standard deviation(STD), and root mean square error(RMSE). Meanwhile, the proposed model exhibits a bell-shaped <br />
curve with a close to zero mean, which indicates that the proposed algorithm's prediction of human divers' <br />
intensions are relatively precise. On the other hand, <math>e_{x,T_p}</math>, <math>e_{y,T_p}</math>, <math>e_{v,T_p}</math> are bounded within <br />
reasonable levels. For instant, the three-sigma range of <math>e_{y,T_p}</math> is within the width of a lane. Therefore, <br />
the proposed algorithm can be precise and maintain safety simultaneously.<br />
<br />
=== Motion planning application ===<br />
==== Case study of a multi-lane left turn scenario ====<br />
The proposed method mimics a human driver better, by simulating a human driver's decision-making process. <br />
In a multi-lane left turn scenario, the proposed algorithm correctly predicted the trajectory of a target vehicle, even when the target vehicle was not following the intersection guideline.<br />
<br />
==== Statistical analysis of motion planning application results ====<br />
The data is analyzed from two perspectives, the time to recognize the in-lane target and the similarity to <br />
human driver commands. In most of cases, the proposed algorithm detects the in-line target no late than based <br />
algorithm. In addition, the proposed algorithm only recognized cases later than the base algorithm did when <br />
the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries. This means <br />
that these cases took place sufficiently beyond the safety distance, and had little influence on determining <br />
the behaviour of the subject vehicle.<br />
<br />
<center>[[Image:Figure11_YanYu.png|500px|]]</center><br />
<br />
In order to compare the similarities between the results from the proposed algorithm and human driving decisions, <br />
this article introduced another type of error, acceleration error <math>a_{x, error} = a_{x, human} - a_{x, cmd}</math>. where <math>a_{x, human}</math><br />
and <math>a_{x, cmd}</math> are the human drivers' acceleration history and the command from the proposed algorithm, <br />
respectively. The proposed algorithm showed more similar results to human drivers’ decisions than the base <br />
algorithm. <math>91.97\%</math> of the acceleration error lies in the region <math>\pm 1 m/s^2</math>. Moreover, the base algorithm <br />
possesses a limited ability to respond to different in-lane target behaviours in traffic flow. Hence, the proposed <br />
model is efficient and safe.<br />
<br />
== Conclusion ==<br />
A surrounding vehicle motion predictor based on an LSTM-RNN at multi-lane turn intersections was developed and its application in an autonomous vehicle was evaluated. The model was trained by using the data captured on the urban road in Seoul in MPC. The evaluation results showed precise prediction accuracy and so the algorithm is safe to be applied on an autonomous vehicle. Also, the comparison with the other three base algorithms (CV/Path, V_flow/Path, and CTRV) revealed the superiority of the proposed algorithm. The evaluation results showed precise prediction accuracy. In addition, the time-to-recognize in-lane targets within the intersection improved significantly over the performance of the base algorithms. The proposed algorithm was compared with human driving data, and it showed similar longitudinal acceleration. The motion predictor can be applied to path planners when AVs travel in unconstructed environments, such as multi-lane turn intersections.<br />
<br />
== Future works ==<br />
This paper has identified several venues for future research, which include:<br />
<br />
1.Developing trajectory prediction algorithms using other machine learning algorithms, such as attention-aware neural networks.<br />
<br />
2.Applying the machine learning-based approach to infer lane change intention at motorways and main roads of urban environments.<br />
<br />
3.Extending the target road of the trajectory predictor, such as roundabouts or uncontrolled intersections, to infer yield intention.<br />
<br />
4.Learning the behavior of surrounding vehicles in real time while automated vehicles drive with real traffic.<br />
<br />
== Critiques ==<br />
The literature review is not sufficient. It should focus more on LSTM, RNN, and the study in different types of roads. Why the LSTM-RNN is used, and the background of the method is not stated clearly. There is a lack of concept so that it is difficult to distinguish between LSTM-RNN based motion predictor and motion planning.<br />
<br />
This is an interesting topic to discuss. This is a major topic for some famous vehicle companies such as Tesla, which now already has a good service called Autopilot to give self-driving and Motion Prediction. This summary can include more diagrams in architecture in the model to give readers a whole view of how the model looks like. Since it is using LSTM-RNN, include some pictures of the LSTM-RNN will be great. I think it will be interesting to discuss more applications by using this method, such as Airplane, boats.<br />
<br />
Autonomous driving is a very hot topic, and training the model with LSTM-RNN is also a meaningful topic to discuss. By the way, it would be an interesting approach to compare the performance of different algorithms or some other traditional motion planning algorithms like KF.<br />
<br />
There are some papers that discussed the accuracy of different models in vehicle predictions, such as Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions[https://arxiv.org/pdf/1908.00219.pdf.] The LSTM didn't show good performance. They increased the accuracy by combing LSTM with an unconstrained model(UM) by adding an additional LSTM layer of size 128 that is used to recursively output positions instead of simultaneously outputting positions for all horizons.<br />
<br />
It may be better to provide the results of experiments to support the efficiency of LSTM-RNN, talk about the prediction of training and test sets, and compared it with other autonomous driving systems that exist in the world.<br />
<br />
The topic of surround vehicle motion prediction is analogous to the topic of autonomous vehicles. An example of an application of these frameworks would be the transportation services industry. Many companies, such as Lyft and Uber, have started testing their own commercial autonomous vehicles.<br />
<br />
It would be really helpful if some visualization or data summary can be provided to understand the content, such as the track of the car movement.<br />
<br />
The model should have been tested in other regions besides just Seoul, as driving behaviors can vary drastically from region to region.<br />
<br />
Understandably, a supervised learning problem should be evaluated on some test dataset. However, supervised learning techniques are inherently ill-suited for general planning problems. The test dataset was obtained from human driving data which is known to be extremely noisy as well as unpredictable when it comes to motion planning. It would be crucial to determine the successes of this paper based on the state-of-the-art reinforcement learning techniques.<br />
<br />
It would be better if the authors compared their method against other SOTA methods. Also one of the reasons motion planning is done using interpretable methods rather than black boxes (such as this model) is because it is hard to see where things go wrong and fix problems with the black box when they occur - this is something the authors should have also discussed.<br />
<br />
A future area of study is to combine other source of information such as signals from Lidar or car side cameras to make a better prediction model.<br />
<br />
It might be interesting and helpful to conduct some training and testing under different weather/environmental conditions, as it could provide more generalization to real-life driving scenarios. For example, foggy weather and evening (low light) conditions might affect the performance of sensors, and rainy weather might require a longer braking distance.<br />
<br />
This paper proposes an interesting, novel model prediction algorithm, using LSTM_RNN. However, since motion prediction in autonomous driving has great real-life impacts, I do believe that the evaluations of the algorithm should be more thorough. For example, more traditional motion planning algorithms such as multi-modal estimation and Kalman filters should be used as benchmarks. Moreover, the experiment results are based on Korean driving conditions only. Eastern and Western drivers can have very different driving patterns, so that should be addressed in the discussion section of the paper as well.<br />
<br />
The paper mentions that in the future, this research plans to learn the real life behaviour of automated vehicles. Seeing a possible improvement in road safety due to this research will be very interesting.<br />
<br />
This predictor is also possible to be applied in the traffic control system.<br />
<br />
This prediction model should consider various conditions that could happen in an intersection. However, normal prediction may not work when there is a traffic jam or in some crowded time periods like rush hours.<br />
<br />
It would be better that the author could provide more comparison between the LSTN-RNN algorithm and other traditional algorithm such as RNN or just LSTM.<br />
<br />
The paper has really good results for what they aimed to achieve. However for the future work it would also be nice to have various climates/weathers to be included in the Seoul dataset. I think it's also important to consider it as different climates/weather (such as snowy roads, or rain) would introduce more noisier data (camera's image processing) and the human drivers behaviour would change as well to adapt to the new environment.<br />
<br />
It would be good to have a future work section to discusses shortage of current algorithms and the possible improvement.<br />
<br />
The summary explains the whole process well, but is missing the small details among the steps. It would be better to explain concepts such as RNN, modelling procedure for first time users.<br />
<br />
This paper presents a nice method, but does not seem particularly well developed. I would have liked to see some more ablations on this particular choice of RNN, as there are more efficient variants such as GRU which show similar performance in other tasks while being more amenable to real-time inference. Furthermore, the multi-model aspect seems slightly ad-hoc, it would have been nice to see a more rigorous formulation similar to seen in some recent work by Zeng et al. from Uber ATG: https://arxiv.org/pdf/2008.06041.pdf.<br />
<br />
The data used for this paper contains driver information exclusively to the urban roads of Gwanak-gu Seoul, hence the data may contain an inherited bias as drivers around the rest of the country, let alone the rest of the world, will have different habits based on different environments. It would be interesting to see if this model can be applied to other cities around the world and exhibit similar results or would there be a need to tune it based off geographic location.<br />
<br />
Since the data is based on urban roads, It would be better to include the details on performance of the model on high traffic area vs low traffic urban area. It would also be interesting to see the performance of the model with many pedestrians.<br />
<br />
While it would be nice to read more on why the authors chose LSTM-RNN, the paper exhibits a potential way to improve autonomous vehicle performance. It would be interesting to see how an army of robots would behave when this paper's method is applied in robotics, since robots' motions also follow a trajectory.<br />
<br />
An interesting topic, but the paper and accompanying summary are missing some details that would improve understandability. With respect to the background of the topic, a more detailed explanation of trajectories in the case of driving would help to better motivate the research. The addition of benchmarks and comparisons to current industry standards (if they are published publicly) would help to contextualize the results of the LSTM-RNN. An area of further study is applying these techniques to different weather situations and driving patterns in different countries. How does this model perform in regions where driver's very loosely follow the laws of the road? Further, could the research be generalized for controlled and uncontrolled turn lanes, especially on roads with higher speed limits?<br />
<br />
In the past, LSTM is a very popular model in natural language. It is interesting to see how it is used for motion prediction. Although the result is not good enough, it is still a good start. In the future, more language models can be used to see how well they can perform to do a comparison such as BiLSTM, Transformer, and Bert.<br />
<br />
LSTM-RNNs are really incredibly useful for sequential learning. The RNN should be able to predict, not only when a vehicle in front of the motion sensors, but it's likelihood of staying in motion vs coming to a sudden stop. There is no details in this review on the running time of the algorithm. In a field where split seconds could be the difference between an accident and a brake, running time cannot be stressed enough. Whether or not it is truly capable of dealing with real time traffic cannot be judged without relevant data and results being shown.<br />
<br />
The prediction could be more flexible and be trained and tested with more complex cases. In real-world, unpredicted situations could always happen, and usual predictions might not be aware of those cases.<br />
<br />
We must notice that the model is not presenting the time data of training and prediction and adaptation to road conditions as it learns the history of conditions. This is an inherently flowed experiment as time is extremely important to how well a model is to predict and assess the movement of a moving vehicle. Even though a desired time gap is given, it is sufficient for a vehicle at moving speed. It does not give what does the model do as it realizes the prediction was wrong and the steps following after as sensors detect new information.<br />
<br />
This is rather late, but I find there is an important concept missing in section 8 of the summary. In the original article, although not explicitly stated, the author compared the difference in recognition timing between two different predictors. It was revealed later that the proposed algorithm only recognized cases later than the base algorithm did when the surrounding target vehicles first appeared beyond the sensors’ region of interest boundaries, which was illustrated using a plot. Although it is not as important as other contents contained in the same section, the plot is still a good evidence to show how the proposed algorithm's performance can be affected by the target vehicles' motion.<br />
<br />
== Reference ==<br />
[1] E. Choi, Crash Factors in Intersection-Related Crashes: An On-Scene Perspective (No. Dot HS 811 366), U.S. DOT Nat. Highway Traffic Safety Admin., Washington, DC, USA, 2010.<br />
<br />
[2] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer, “Generalizable intention prediction of human drivers at intersections,” in Proc. IEEE Intell. Veh. Symp. (IV), Los Angeles, CA, USA, 2017, pp. 1665–1670.<br />
<br />
[3] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi, “Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network,” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC), Yokohama, Japan, 2017, pp. 399–404.<br />
<br />
[4] E. Strigel, D. Meissner, F. Seeliger, B. Wilking, and K. Dietmayer, “The Ko-PER intersection laserscanner and video dataset,” in Proc. 17th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Qingdao, China, 2014, pp. 1900–1901.<br />
<br />
[5] Henggang Cui, Thi Nguyen, Fang-Chieh Chou, Tsung-Han Lin, Jeff Schneider, David Bradley, Nemanja Djuric: “Deep Kinematic Models for Kinematically Feasible Vehicle Trajectory Predictions”, 2019; [http://arxiv.org/abs/1908.00219 arXiv:1908.00219].<br />
<br />
[6]Schulz, Jens & Hubmann, Constantin & Morin, Nikolai & Löchner, Julian & Burschka, Darius. (2019). Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios. 10.1109/IVS.2019.8814080.</div>Y87yuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Speech2Face:_Learning_the_Face_Behind_a_VoiceSpeech2Face: Learning the Face Behind a Voice2020-11-23T00:33:07Z<p>J27ni: /* Discussion and Critiques */</p>
<hr />
<div>== Presented by == <br />
Ian Cheung, Russell Parco, Scholar Sun, Jacky Yao, Daniel Zhang<br />
<br />
== Introduction ==<br />
This paper presents a deep neural network architecture called Speech2Face which utilizes millions of Internet/Youtube videos of people speaking to learn the correlation between a voice and the respective face. The model produces facial reconstruction images that capture specific physical attributes learning the correlations between faces and voices, such as a person's age, gender, or ethnicity, through a self-supervised procedure. Namely, the model utilizes the simultaneous occurrence of faces and speech in videos and does not need to model the attributes explicitly. This model explores what types of facial information could be extracted from speech without the constraints of predefined facial characterizations. Without any prior information or accurate classifiers, the reconstructions revealed correlations between craniofacial features and voice in addition to the correlation between dominant features (gender, age, ethnicity, etc.) and voice. The model is evaluated and numerically quantifies how closely the reconstruction, done by the Speech2Face model, resembles the true face images of the respective speakers.<br />
<br />
== Ethical Considerations ==<br />
<br />
The authors note that due to the potential sensitivity of facial information, they have chosen to explicitly state some ethical considerations. The first of which is privacy. The paper states that the method cannot recover the true identity of the face or produce faces of specific individuals, but rather will show average-looking faces. The paper also addresses that there are potential dataset biases that exist for the voice-face correlations, thus the faces may not accurately represent the intended population. The paper recommends that any further investigation or practical use of this technology will be tested to represent the intended population and also if the data does not reflect this, more representative data should be broadly collected. Finally, it acknowledges that the model uses demographic categories such as "White" and "Asian" that are defined by a commercial face attribute classifier.<br />
<br />
== Previous Work ==<br />
With visual and audio signals being so dominant and accessible in our daily life, there has been huge interest in how visual and audio perceptions interact with each other. Arandjelovic and Zisserman [1] leveraged the existing database of mp4 files to learn a generic audio representation to classify whether a video frame and an audio clip correspond to each other. These learned audio-visual representations have been used in a variety of setting, including cross-modal retrieval, sound source localization and sound source separation. This also paved the path for specifically studying the association between faces and voices of agents in the field of computer vision. In particular, cross-modal signals extracted from faces and voices have been proposed as a binary or multi-task classification task and there have been some promising results. Studies have been able to identify active speakers of a video, separate speech from multiple concurrent sources, predict lip motion from speech, and even learn the emotion of the agents based on their voices. Aytar et al. [6] proposed a student-teacher training procedure in which a well established visual recognition model was used to transfer the knowledge obtained in the visual modality to the sound modality, using unlabeled videos.<br />
<br />
Recently, various methods have been suggested to use various audio signals to reconstruct visual information, where the reconstructed subject is subjected to a priori. Notably, Duarte et al. [2] were able to synthesize the exact face images and expression of an agent from speech using a GAN model. A generative adversarial network (GAN) model is one that uses a generator to produce seemingly possible data for training and a discriminator that identifies if the training data is fabricated by the generator or if it is real [7]. This paper instead hopes to recover the dominant and generic facial structure from a speech.<br />
<br />
== Motivation ==<br />
It seems to be a common trait among humans to imagine what some people look like when we hear their voices before we see what they look like. There is a strong connection between speech and appearance, which is a direct result of the factors that affect speech, including age, gender, and facial bone structure. In addition, other voice-appearance correlations stem from the way we talk: language, accent, speed, pronunciations, etc. These properties of speech are often common among many different nationalities and cultures, which can, in turn, translate to common physical features among different voices. Namely, from an input audio segment of a person speaking, the method would reconstruct an image of the person’s face in a canonical form (frontal-facing, neutral expression). The goal was to study to what extent people can infer how someone else looks from the way they talk. Rather than predicting a recognizable image of the exact face, the authors are more interested in capturing the dominant facial features.<br />
<br />
== Model Architecture == <br />
<br />
'''Speech2Face model and training pipeline'''<br />
<br />
[[File:ModelFramework.jpg|center]]<br />
<br />
<div style="text-align:center;"> Figure 1. '''Speech2Face model and training pipeline''' </div><br />
<br />
<br />
<br />
The Speech2Face Model consists of two parts - a voice encoder which takes in a spectrogram of speech as input and outputs low dimensional face features, and a face decoder which takes in face features as input and outputs a normalized image of a face (neutral expression, looking forward). Figure 1 gives a visual representation of the pipeline of the entire model, from video input to a recognizable face. The combination of the voice encoder and face decoder results are combined to form an image. The variability in facial expressions, head positions and lighting conditions of the face images creates a challenge to both the design and training of the Speech2Face model. It needs a model to figure out many irrelevant variations in the data, and to implicitly extract important internal representations of faces. To avoid this problem the model is trained to first regress to a low dimensional intermediate representation of the face. <br />
<br />
'''Face Decoder''' <br />
The face decoder itself was taken from previous work The VGG-Face model by Cole et al [3] (a face recognition model that is pretrained on a largescale face database [5] is used to extract a 4069-D face feature from the penultimate layer of the network.) and will not be explored in great detail here, but in essence the facenet model is combined with a single multilayer perceptron layer, the result of which is passed through a convolutional neural network to determine the texture of the image, and a multilayer perception to determine the landmark locations. The face decoder kept the VGG-Face model's dimension and weights. The weights were also trained separately and remained fixed during the voice encoder training. <br />
<br />
'''Voice Encoder Architecture''' <br />
<br />
[[File:VoiceEncoderArch.JPG|center]]<br />
<br />
<div style="text-align:center;"> Table 1: '''Voice encoder architecture''' </div><br />
<br />
<br />
<br />
The voice encoder itself is a convolutional neural network, which transforms the input spectrogram into pseudo face features. The exact architecture is given in Table 1. The model alternates between convolution, ReLU, batch normalization layers, and layers of max-pooling. In each max-pooling layer, pooling is only done along the temporal dimension of the data. This is to ensure that the frequency, an important factor in determining vocal characteristics such as tone, is preserved. In the final pooling layer, an average pooling is applied along the temporal dimension. This allows the model to aggregate information over time and allows the model to be used for input speeches of varying lengths. Two fully connected layers at the end are used to return a 4096-dimensional facial feature output.<br />
<br />
'''Training'''<br />
<br />
The AVSpeech dataset, a large-scale audio-visual dataset is used for the training. AVSpeech dataset is comprised of millions of video segments from Youtube with over 100,000 different people. The training data is composed of educational videos and does not provide an accurate representation of the global population, which will clearly affect the model. Also note that facial features that are irrelevant to speech, like hair color, may be predicted by the model. From each video, a 224x224 pixels image of the face was passed through the face decoder to compute a facial feature vector. Combined with a spectrogram of the audio, a training and test set of 1.7 and 0.15 million entries respectively were constructed.<br />
<br />
The voice encoder is trained in a self-supervised manner. A frame that contains the face is extracted from each video and then inputted to the VGG-Face model to extract the feature vector <math>v_f</math>, the 4096-dimensional facial feature vector given by the face decoder on a single frame from the input video. This provides the supervision signal for the voice-encoder. The feature <math>v_s</math>, the 4096 dimensional facial feature vector from the voice encoder, is trained to predict <math>v_f</math>.<br />
<br />
In order to train this model, a proper loss function must be defined. The L1 norm of the difference between <math>v_s</math> and <math>v_f</math>, given by <math>||v_f - v_s||_1</math>, may seem like a suitable loss function, but in actuality results in unstable results and long training times. Figure 2, below, shows the difference in predicted facial features given by <math>||v_f - v_s||_1</math> and the following loss. Based on the work of Castrejon et al. [4], a loss function is used which penalizes the differences in the last layer of the VGG-Face model <math>f_{VGG}</math>: <math> \mathbb{R}^{4096} \to \mathbb{R}^{2622}</math> and the first layer of face decoder <math>f_{dec}</math> : <math> \mathbb{R}^{4096} \to \mathbb{R}^{1000}</math>. The final loss function is given by: $$L_{total} = ||f_{dec}(v_f) - f_{dec}(v_s)|| + \lambda_1||\frac{v_f}{||v_f||} - \frac{v_s}{||v_s||}||^2_2 + \lambda_2 L_{distill}(f_{VGG}(v_f), f_{VGG}(v_s))$$<br />
This loss penalizes on both the normalized Euclidean distance between the 2 facial feature vectors and the knowledge distillation loss, which is given by: $$L_{distill}(a,b) = -\sum_ip_{(i)}(a)\text{log}p_{(i)}(b)$$ $$p_{(i)}(a) = \frac{\text{exp}(a_i/T)}{\sum_j \text{exp}(a_j/T)}$$ Knowledge distillation is used as an alternative to Cross-Entropy. By recommendation of Cole et al [3], <math> T = 2 </math> was used to ensure a smooth activation. <math>\lambda_1 = 0.025</math> and <math>\lambda_2 = 200</math> were chosen so that magnitude of the gradient of each term with respect to <math>v_s</math> are of similar scale at the <math>1000^{th}</math> iteration.<br />
<br />
<center><br />
[[File:L1vsTotalLoss.png | 700px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 2: '''Qualitative results on the AVSpeech test set''' </div><br />
<br />
'''Implementation Details'''<br />
<br />
6 seconds of audio was used to compute the spectogram by taking a Short-time Fourier transform with Hann window of 25mm, hop length of 10ms, and 512 FFT frequenct bands. A CNN-based face detector from Dlib was used to crop the face regions from the frames. The VGG-face features are computed from the resized faces and together with the spectrogram was used for training. There were a total of 1.7 and 0.15 million spectra-feature pairs.<br />
<br />
== Results ==<br />
<br />
'''Confusion Matrix and Dataset statistics'''<br />
<br />
<center><br />
[[File:Confusionmatrix.png| 600px]]<br />
</center><br />
<br />
<div style="text-align:center;"> Figure 3. '''Facial attribute evaluation''' </div><br />
<br />
<br />
<br />
In order to determine the similarity between the generated images and the ground truth, a commercial service known as Face++ which classifies faces for distinct attributes (such as gender, ethnicity, etc) was used. Figure 3 gives a confusion matrix based on gender, ethnicity, and age. By examining these matrices, it is seen that the Speech2Face model performs very well on gender, only misclassifying 6% of the time. Similarly, the model performs fairly well on ethnicities, especially with white or Asian faces. Although the model performs worse on black and Indian faces, that can be attributed to the vastly unbalanced data, where 50% of the data represented a white face, and 80% represented a white or Asian face. <br />
<br />
'''Feature Similarity'''<br />
<br />
<center><br />
[[File:FeatSim.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 2. '''Feature similarity''' </div><br />
<br />
<br />
<br />
Another examination of the result is the similarity of features predicted by the Speech2Face model. The cosine, L1, and L2 distance between the facial feature vector produced by the model and the true facial feature vector from the face decoder were computed, and presented, above, in Table 2. A comparison of facial similarity was also done based on the length of audio input. From the table, it is evident that the 6-second audio produced a lower cosine, L1, and L2 distance, resulting in a facial feature vector that is closer to the ground truth. <br />
<br />
'''S2F -> Face retrieval performance'''<br />
<br />
<center><br />
[[File: Retrieval.JPG]]<br />
</center><br />
<br />
<div style="text-align:center;"> Table 3. '''S2F -> Face retrieval performance''' </div><br />
<br />
<br />
<br />
The performance of the model was also examined on how well it could produce the original image. The R@K metric, also known as retrieval performance by recall at K, measures the probability that the K closest images to the model output includes the correct image of the speaker's face. A higher R@K score indicates better performance. From Table 3, above, we see that both the 3-second and 6-second audio showed significant improvement over random chance, with the 6-second audio performing slightly better.<br />
<br />
'''Additional Observations''' <br />
<br />
Ablation studies were carried out to test the effect of audio duration and batch normalization. It was found that the duration of input audio during the training stage had little effect on convergence speed (comparing 3 and 6-second speech segments), while in the test stage longer input speech yields improvement in reconstruction quality. With respect to batch normalization (BN), it was found that without BN reconstructed faces would converge to an average face, while the inclusion of BN led to results which contained much richer facial features.<br />
<br />
== Conclusion ==<br />
The report presented a novel study of face reconstruction from audio recordings of a person speaking. The model was demonstrated to be able to predict plausible face reconstructions with similar facial features to real images of the person speaking. We have proved that our method can predict the attributes that the real face of the face is consistent with the real image. By directly reconstructing the face from this cross-modal feature space, we visually verify the existence of cross-mode biometrics. The problem was addressed by learning to align the feature space of speech to that of a pretrained face decoder. The model was trained on millions of videos of people speaking from YouTube. The model was then evaluated by comparing the reconstructed faces with a commercial facial detection service. The authors believe that facial reconstruction allows a more comprehensive view of voice-face correlation compared to predicting individual features, which may lead to new research opportunities and applications.<br />
<br />
== Discussion and Critiques ==<br />
<br />
There is evidence that the results of the model may be heavily influenced by external factors:<br />
<br />
1. Their method of sampling random YouTube videos resulted in an unbalanced sample in terms of ethnicity. Over half of the samples were white. We also saw a large bias in the model's prediction of ethnicity towards white. The bias in the results shows that the model may be overfitting the training data and puts into question what the performance of the model would be when trained and tested on a balanced dataset. Figure (11) highlights this shortcoming: The same man heard speaking in either English or Chinese was predicted to have a "white" appearance or an "asian" appearance respectively.<br />
<br />
2. The model was shown to infer different face features based on language. This puts into question how heavily the model depends on the spoken language. The paper mentioned the quality of face reconstruction may be affected by uncommon languages, where English is the most popular language on Youtube(training set). Testing a more controlled sample where all speech recording was of the same language may help address this concern to determine the model's reliance on spoken language.<br />
<br />
3. The evaluation of the result is also highly dependent on the Face++ classifiers. Since they compare the age, gender, and ethnicity by running the Face++ classifiers on the original images and the reconstructions to evaluate their model, the model that they create can only be as good as the one they are using to evaluate it. Therefore, any limitations of the Face++ classifier may become a limitation of Speech2Face and may result in a compounding effect on the miss-classification rate.<br />
<br />
4. Figure 4.b shows the AVSpeech dataset statistics. However, it doesn't show the statistics about speakers' ethnicity and the language of the video. If we train the model with a more comprehensive dataset that includes enough Asian/Indian English speakers and native language speakers will this increase the accuracy?<br />
<br />
5. One concern about the source of the training data, i.e. the Youtube videos, is that resolution varies a lot since the videos are randomly selected. That may be the reason why the proposed model performs badly on some certain features. For example, it is hard to tell the age when the resolution is bad because the wrinkles on the face are neglected.<br />
<br />
6. The topic of this project is very interesting, but I highly doubt this model will be practical in real-world problems. Because there are many factors to affect a person's sound in a real-world environment. Sounds such as phone clock, TV, car horn and so on. These sounds will decrease the accuracy of the predicted result of the model.<br />
<br />
7. A lot of information can be obtained from someone's voice, this can potentially be useful for detective work and crime scene investigation. In our world of increasing surveillance, public voice recording is quite common and we can reconstruct images of potential suspects based on their voice. In order for this to be achieved, the model has to be thoroughly trained and tested to avoid false positives as it could have a highly destructive outcome for a falsely convicted suspect.<br />
<br />
8. This is a very interesting topic, and this summary has a good structure for readers. Since this model uses Youtube to train model, but I think one problem is that most of the YouTubers are adult, and many additional reasons make this dataset highly unbalanced. What is more, some people may have a baby voice, this also could affect the performance of the model. But overall, this is a meaningful topic, it might help police to locate the suspects. So it might be interesting to apply this to the police.<br />
<br />
9. In addition, it seems very unlikely that any results coming from this model would ever be held in regard even remotely close to being admissible in court to identify a person of interest until the results are improved and the model can be shown to work in real-world applications. Otherwise, there seems to be very little use for such technology and it could have negative impacts on people if they were to be depicted in an unflattering way by the model based on their voice.<br />
<br />
10. Using voice as a factor of constructing the face is a good idea, but it seems like the data they have will have lots of noise and bias. The voice of a video might not come from the person in the video. There are so many YouTubers adjusting their voices before uploading their video and it's really hard to know whether they adjust their voice. Also, most YouTubers are adults so the model cannot have enough training samples about teenagers and kids.<br />
<br />
11. It would be interesting to see how the performance changes with different face encoding sizes (instead of just 4096-D) and also difference face models (encoder/decoders) to see if better performance can be achieved. Also given that the dataset used was unbalanced, was the dataset used to train the face model the same dataset? or was a different dataset used (the model was pretrained). This could affect the performance of the model as well.<br />
<br />
12. The audio input is transformed into a spectrogram before being used for training. They use STFT with a Hann window of 25 mm, a hop length of 10 ms, and 512 FFT frequency bands. They cite this method from a paper that focuses on speech separation, not speech classification. So, it would be interesting to see if there is a better way to do STFT, possibly with different hyperparameters (eg. different windowing, different number of bands), or if another type of transform (eg. wavelet transform) would have better results.<br />
<br />
13. A easy way to get somewhat balanced data is to duplicate the data that are fewer.<br />
<br />
14. This problem is interesting but is hard to generalize. This algorithm didn't account for other genders and mixed-race. In addition, the face recognition software Face++ introduces bias which can carry forward to Speech2Face algorithm. Face recognition algorithms are known to have higher error rates classifying darker-skinned individuals. Thus, it'll be tough to apply it to real-life scenarios like identifying suspects.<br />
<br />
15. This experiment raises a lot of ethical complications when it comes to possible applications in the real world. Even if this model was highly accurate, the implications of being able to discern a person's racial ethnicity, skin tone, etc. based solely on there voice could play in to inherent biases in the application user and this may end up being an issue that needs to be combatted in future research in this area. Another possible issue is that many people will change their intonation or vocal features based on the context (I'll likely have a different voice pattern in a job interview in terms of projection, intonation, etc. than if I was casually chatting/mumbling with a friend while playing video games for example).<br />
<br />
16. Overall a very interesting topic. I want to talk about the technical challenged raised by using the AVSSpeech dataset for training. The paper acknowledges that the AVSSpeech is unbalanced, and 80% of the data are white and Asians. It also says in the results section that "Our model does not perform on other races due to the imbalance in data". There does not seem to be any effort made in balancing the data. I think that there are definitely some data processing techniques that can be used (filtering, data augmentation, etc) to address the class imbalance problem. Not seeing any of these in the paper is a bit disappointing. Another issue I have noticed is that the model aims to predict an average-looking face from certain gender/racial group from voice input, due to ethical considerations. If we cannot reveal the identify of a person, why don't we predict the gender and race directly? Giving an average-looking face does not seem to be the most helpful.<br />
<br />
17. Very interesting research paper to be studied and the main objective was also interesting. This research leads to open question which can be applied to another application such as predicting person's face using voice and can be used in more advanced way. The only risk is how the data is obtained from YouTube where data is not consistent.<br />
<br />
18. The essay uses millions of natural videos of people speaking to find the correlation between face and voice. Since face and voice are commonly used as the identity of a person, there are many possible research opportunities and applications about improving voice and face unlock.<br />
<br />
19. It would be better to have a future work section to discuss the current shortage and explore the possible improvement and applications in the future.<br />
<br />
20. While the idea behind Speech2Face is interesting, ethnic profiling is a huge concern and it can further lead to racial discrimination, racism etc. Developers must put more care and thought into applying Speech2Face in tech before deploying the products.<br />
<br />
21. It would be helpful if the author could explore the different applications of this project in real life. Speech2face can be helpful during criminal investigation and essentially in scenarios when someone's picture is missing and only voice is available. It would also be helpful if the author could state the importance and need of such kind project in the society.<br />
<br />
22. The authors mention that they use the AVSpeech dataset for both training and testing but do not talk about how they split the data. It is possible that the same speakers were used in the training and testing data and so the model is able to recreate a face simply by matching the observed face to the observed audio. This would explain the striking example images shown in the paper.<br />
<br />
23. Another interesting application of this research is automated speech or facial animation at scale or in multiple languages. The cutting-edge automated facial animation solution provided by JALI Research Inc is applied in Cyberpunk 2077.<br />
<br />
24. It would be interesting to know the model can predict a similar face when one is speaking different languages. A person who is speaking multiple languages can have different tones and accents depending on a language that they speak.<br />
<br />
25. The results are actually amazing for the introduction of Speech2Face. As others have mentioned, the researchers might have used a biased dataset of YouTube videos favoring certain ethnicities and their accents and dialects. Thus, it would be nice to also see the data distribution. Additionally it would be nice to see how their model reacts to people who are able to speak multiple languages and see how well Speech2Face generalizes different language pronunciations of one person.<br />
<br />
26. The paper introduces Speech2Face and it definitely is one of the major areas of researches in the future. In the paper, the confusion matrix indicates that the model tends to misclassify based on the age of the speaking person. Specifically, the model tends to misclassify between 40-70. It would be interesting to see if the model could improve on its bottleneck by training on more speeches by the age group 40-70.<br />
<br />
27. An interesting topic, and as others have mentioned, has many ethical considerations and implications. Particularly in regions where call-recording is permitted, there is dangerous potential to for the technology to be misused to identify and target individuals. It would also be interesting to get a more in depth exploration into how the language spoken and accents have a bias. For example, if a person speaks with a strong British accent, are they classified as white? Particularly for Spanish-speakers, they vary greatly with respect to their skin colour and features, how well does the algorithm work on these individuals. A last nit-pick is the labelling used (i.e. Asian, White, Indian, Black) as this is not accurate since Indians, and moreover South Asians, fall under Asian as well.<br />
<br />
28. This topic is quite interesting and it could have great contribution in terms of criminal fight. But as the result, the accuracy is essential. There is still the space for much improvements since to tell a person's face by his/her voice is pretty hard since there are many factors such as oral structure, the language environment and even personality. Great bias could be resulted from these unpredictable factors.<br />
<br />
29. This is an interesting topic and could have great use in terms of finding criminals or people when having their voice recorded. However, the voice recording might be noisy and some might include voices of multiple people. It could consider ways to eliminate those factors that might effect the accuracy of the face generation.<br />
<br />
30. Most contents described in the paper are very useful. However, YouTube might not be a good enough data source since there are fewer labels to classify. Perhaps, after generating the model, the transfer learning could be done based on Facebook's videos in order to solve the imbalanced problem.<br />
<br />
31. This topic is really incredibly interesting and the writers should commend themselves on a job well done. However, Youtube, not only is it an ethnically skewed dataset, but has a non-negligible number of creators who use voice modifiers, auto tune, or a number of other things to change the pitch of their voices, which may lead to the significantly more errors in practical applications. A better dataset to be used could be Skype video calls, or a class room study. Also, judging from the way the model does it's prediction, it seems very prone to overfitting on the dataset, and will not generalize well, since pitch and sound are both incredibly variable across humans.<br />
<br />
32. One thing to notice is that the training data used to train the model is downloaded from Youtube, which may be a good site to retrieve a large amount of data. While it allows the possibility that the voices retrieved does not match with the people who made those sounds, claimed by the video. If that is the case, those records will become dirty data, and needs to be cleaned before training the model. Otherwise, there will be some huge misclassifications because some of the training data is not making sense. One way I can think of to improve this problem is that we may train multiple models on different subsets of the original dataset, and combine the results of all the models by taking weighted average.<br />
<br />
33. Predicting appearance with sound is a very imaginative research direction. But the author did not explain how to exclude environmental factors in data preprocessing, such as light intensity, facial dress, facial wounds, etc. In the training data set, different sound and image resolutions also affect the effectiveness of the model. The author needs more robustness tests to exclude these factors.<br />
<br />
34. The origin of data is primarily Western sourced. Even if taken into account the other ethnicities being selected, the resolution is decompression/compression quality is not maintained for an uploaded platform such as YouTube. Furthermore, since ethnicities are a factor in the model, people not speaking in their mother tongue might provide a very difficult sample for them to match since people tend to sound a certain way (volume, indentation, confidence, tone etc) which can all affect the quality of a recording. <br />
<br />
35. With this model's predictions, can it be used by law enforcement to make preemptive identification (with facial recognition) to capture/identify criminals (on surveillance mission)?<br />
<br />
36. Interesting article. I wonder how it will result if a participant speaks with a higher/lower pitch, or tries to imitate someone else intentionally.<br />
<br />
37. The result section can be expanded a lot more, the original paper studied something really interesting. For example, Craniofacial features were utilized to capture ratios and distances in the face, and how to regenerate a cartoon version of a particular face base on existing face images etc.<br />
<br />
== References ==<br />
[1] R. Arandjelovic and A. Zisserman. Look, listen and learn. In<br />
IEEE International Conference on Computer Vision (ICCV),<br />
2017.<br />
<br />
[2] A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, and X. Giroi-Nieto. Wav2Pix: speech-conditioned face generation using generative adversarial networks. In IEEE International<br />
Conference on Acoustics, Speech and Signal Processing<br />
(ICASSP), 2019.<br />
<br />
[3] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Synthesizing normalized faces from facial identity features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.<br />
<br />
[4] L. Castrejon, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba. Learning aligned cross-modal representations from weakly aligned data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.<br />
<br />
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference (BMVC), 2015.<br />
<br />
[6] “Overview of GAN Structure | Generative Adversarial Networks,” ''Google Developers'', 2019. [Online]. Available: https://developers.google.com/machine-learning/gan/gan_structure. [Accessed: 02-Dec-2020].</div>D287zhanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Point-of-Interest_Recommendation:_Exploiting_Self-Attentive_Autoencoders_with_Neighbor-Aware_InfluencePoint-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence2020-11-23T00:25:07Z<p>J27ni: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Guanting(Tony) Pan, Zaiwei(Zara) Zhang, Haocheng(Beren) Chang<br />
<br />
== Introduction == <br />
With the development of mobile devices and location-acquisition technologies, accessing real-time location information is becoming easier and more efficient. Precisely because of this development, Location-based Social Networks (LBSNs) like Yelp and Foursquare have become an important part of human life. People can share their experiences in locations, such as restaurants and parks, on the Internet. These locations can be seen as a Point-of-Interest (POI) in software such as Maps on our phone. These large amounts of user-POI interaction data can be used to provide a service, which is called personalized POI recommendation, to give recommendations to users of a location they might be interested in. These large amounts of data can be used to train a model through Machine Learning methods(i.e., Classification, Clustering, etc.) to predict a POI that users might be interested in. The POI recommendation system still faces some challenging issues: (1) the difficulty of modeling complex user-POI interactions from sparse implicit feedback; (2) the difficulty of incorporating geographic background information. In order to meet these challenges, this paper will introduce a novel autoencoder-based model to learn non-linear user-POI relations, which is called SAE-NAD. SAE stands for the self-attentive encoder, while NAD stands for the neighbor-aware decoder. An autoencoder is an unsupervised learning technique that we implement in the neural network model for representation learning, meaning that our neural network will contain a "bottleneck" layer that produces a compressed knowledge representation of the original input. This method will utilize machine learning knowledge that we learned in this course.<br />
<br />
== Previous Work == <br />
<br />
In the previous works, the method is just equally treating users checked in POIs. The drawback of equally treating users checked in POIs is that valuable information about the similarity between users is not utilized, thus reducing such recommenders' power. However, the SAE adaptively differentiates user preference degrees in multiple aspects.<br />
<br />
Previous methods mainly used a process called collaborative filtering, which can be divided into memory-based methods and model-based methods. Collaborative filtering makes recommendations from historical user-system interactions like user’s feedback or browsing history. Content-based and hybrid recommendation systems are also commonly used. Content-based recommendation system compares users’ information like texts, videos, and images. Hybrid model combines two or more recommendation systems. Memory-based methods predict a user preference based on a weighted average of similar users or POIs. Model-based methods use user-POI data to build a model for generating recommendations. Both methods typically model user preferences linearly, which may be an oversimplification.<br />
<br />
There are some other personalized POI recommendation methods that can be used. Some famous software (e.g., Netflix) uses model-based methods that are built on matrix factorization (MF). For example, ranked based Geographical Factorization Method in [1] adopted weighted regularized MF to serve people on POI. Machine learning is popular in this area. POI recommendation is an important topic in the domain of recommender systems [4]. This paper also described related work in Personalized location recommendation and attention mechanism in the recommendation. The recent studies on location recommendation methods using historical data (check-ins, comments, etc.)<br />
<br />
== Motivation == <br />
This paper reviews encoder and decoder. A single hidden-layer autoencoder is an unsupervised neural network, which consists of two parts: an encoder and a decoder. The encoder has one activation function that maps the input data to the latent space. The decoder also has one activation function mapping the representations in the latent space to the reconstruction space. And here is the formula:<br />
<br />
[[File: formula.png|center]](Note: a is the activation function)<br />
<br />
The proposed method uses a two-layer neural network to compute the score matrix in the architecture of the SAE. The NAD adopts the RBF kernel to make checked-in POIs exert more influence on nearby unvisited POIs. To train this model, Network training is required.<br />
<br />
This paper will use the datasets in the real world, which are from Gowalla[2], Foursquare [3], and Yelp[3]. These datasets would be used to train by using the method introduced in this paper and compare the performance of SAE-NAD with other POI recommendation methods. Three groups of methods are used to compare with the proposed method, which are traditional MF methods for implicit feedback, Classical POI recommendation methods, and Deep learning-based methods. Specifically, the Deep learning-based methods contain a DeepAE which is a three-hidden-layer autoencoder with a weighted loss function, we can connect this to the material in this course.<br />
<br />
Autoencoders (AE), due to their ability to represent complex data, have become very useful in recommendation systems. The primary reason it is used is because with deep neural network and with non-linear activation function it effectively captures the non-linear and non-trivial relationships between users and POI's.<br />
<br />
== Methodology == <br />
<br />
=== Notations ===<br />
<br />
Here are the notations used in this paper. It will be helpful when trying to understand the structure and equations in the algorithm.<br />
[[File:notations.JPG|500px|x300px|center]]<br />
<br />
=== Structure ===<br />
<br />
The structure of the network in this paper includes a self-attentive encoder as the input layer(yellow), and a neighbor-aware decoder as the output layer(green).<br />
<br />
[[File:1.JPG|1200px|x600px]]<br />
<br />
=== Self-Attentive Encoder ===<br />
<br />
The self-attentive encoder is the input layer. It transfers the preference vector <math>x_u</math> to hidden representation <math>A_u</math> using weight matrix <math>W^1</math> and the activation function <math>softmax</math> and <math>tanh</math>. The 0's and 1's in <math>x_u</math> indicates whether the user has been to a certain POI. The weight matrix <math>W_a</math> assigns different weights on various features of POIs. <math>A_u</math> is the importance score matrix, where each column is a POI, and each row is the importance level of the <math>n</math> checked in POIs in one aspect.<br />
<br />
[[File:encoder.JPG|center]]<br />
<br />
<math>\mathbf{Z}_u^{(1)} = \mathbf{A}_u \cdot (\mathbf{W}^{(1)}[L_u])^\top</math> is the multiplication of the importance score matrix and the POI embeddings, and represents the user from <math>d_a</math> aspects. <math>\mathbf{Z}_u^{(1)}</math> is a <math>d_a \times H_1</math> matrix. The aggregation layer then combines multiple aspects that represent the user into one, so that it fits into the Neighbor-Aware decoder: <math>\mathbf{z}_u^{(1)} = a_t(\mathbf{Z}_u^{(1)\top}\mathbf{w}_t + \mathbf{b}_t)</math>.<br />
<br />
=== Neighbor-Aware Decoder ===<br />
<br />
POI recommendation uses the geographical clustering phenomenon, which increases the weight of the unvisited POIs that surround the visited POIs. Also, an aggregation layer is added to the network to aggregate users’ representations from different aspects into one aspect. This means that a person who has visited a location is very likely to return to this location again in the future, so the user is recommended POIs surrounding this area. An example would be someone who has been to the UW plaza and bought Lazeez, is very likely to return to the plaza, therefore the person is recommended to try Mr. Panino's Beijing House.<br />
<br />
[[File:decoder.JPG|center]]<br />
<br />
=== Objective Function ===<br />
<br />
By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation. After that, the training is complete. By minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with backpropagation.<br />
<br />
[[File:objective_function.JPG|center]]<br />
<br />
== Comparative analysis ==<br />
<br />
=== Metrics introduction ===<br />
To obtain a comprehensive evaluation of the effectiveness of the model, the authors performed a thorough comparison between the proposed model and the existing major POI recommendation methods. These methods can be further broken down into three categories: traditional matrix factorization methods for implicit feedback, classical POI recommendation methods, and deep learning-based methods. Here, three key evaluation metrics were introduced as Precison@k, Recall@k, and MAP@k. Through comparing all models on three datasets using the above metrics, it is concluded that the proposed model achieved the best performance.<br />
<br />
To better understand the comparison results, it is critical to understand the meanings behind each evaluation metric. Suppose the proposed model generated k recommended POIs for the user. The first metric, Precison@k, measures the percentage of the recommended POIs which the user has visited. Recall@k is also associated with the user’s behavior. However, it will measure the percentage of recommended POIs in all POIs which have been visited by the user. Lastly, MAP@k represents the mean average precision at k, where average precision is the average of precision values at all k ranks, where relevant POIs are found. They are formally defined as follows, <br />
$$ Recall@k=\frac{1}{M} \sum_{i=1}^M \frac{S_i(k) \cap T_i}{|T_i|} \\<br />
Precision@k = \frac{1}{M} \sum_{i=1}^M \frac{S_i(k) \cap T_i}{k} \\<br />
MAP@k = \frac{1}{M} \sum_{i=1}^M \frac{\sum_{j=1}^k p(j) \times rel(j)}{|T_i|} \\$$<br />
<br />
where <math>S_i(k) </math> us a set of top- <math>k </math> unvisited locations recommended to user <math> i </math> excluding those locations in the training, and <math>T_i </math> is a set of locations that are visited by user <math> i</math> in the testing. <math> p(j)</math> is the precision of a cut-off rank list from 1 to <math> j</math>, and <math>rel(j) </math> is an indicator function that equals to 1 if the location is visited in the testing, otherwise equals to 0.<br />
<br />
=== Model Comparison ===<br />
Among all models in the comparison group, RankGeoFM, IRNMF, and PACE produced the best results. Nonetheless, these models are still incomparable to our proposed model. The reasons are explained in details as follows:<br />
<br />
Both RankGeoFM and IRNMF incorporate geographical influence into their ranking models, which is significant for generating POI recommendations. However, they are not capable of capturing non-linear interactions between users and POIs. In comparison, the proposed model, while incorporating geographical influence, adopts a deep neural structure which enables it to measure non-linear and complex interactions. As a result, it outperforms the two methods in the comparison group.<br />
<br />
Moreover, compared to PACE, which is a deep learning-based method, the proposed model offers a more precise measurement of geographical influence. Though PACE is able to capture complex interactions, it models the geographical influence by a context graph, which fails to incorporate user reachability into the modeling process. In contrast, the proposed model is able to capture geographical influence directly through its neighbor-aware decoder, which allows it to achieve better performance than the PACE model.<br />
<br />
[[File:model_comparison.JPG|center]]<br />
<br />
== Conclusion ==<br />
In summary, the proposed model, namely SAE-NAD, clearly showed its advantages compared to many state-of-the-art baseline methods. Its self-attentive encoder effectively discriminates user preferences on check-in POIs, and its neighbor-aware decoder measures geographical influence precisely through differentiating user reachability on unvisited POIs. By leveraging these two components together, it can generate recommendations that are highly relevant to its users.<br />
<br />
== Critiques ==<br />
Besides developing the model and conducting a detailed analysis, the authors also did very well in constructing this paper. The paper is well-written and has a highly logical structure. Definitions, notations, and metrics are introduced and explained clearly, which enables readers to follow through the analysis easily. Last but not least, both the abstract and the conclusion of this paper are strong. The abstract concisely reported the objectives and outcomes of the experiment, whereas the conclusion is succinct and precise.<br />
<br />
This idea would have many applications, such as suggesting new restaurants to customers in the food delivery service app. Would the improvement in accuracy outweigh the increased complexity of the model when it comes to use in industry?<br />
<br />
It would be nice if the authors could describe the extensive experiments on the real-world datasets, with different baseline methods and evaluation metrics, to demonstrate the effectiveness of the proposed model. Moreover, show the comparison result in tables vs other methodologies, both in terms of accuracy and time-efficiency. In addition, the drawbacks of this new methodology are unknown to the readers. In other words, how does this compare to the already established recommendation systems found in large scale applications utilize by companies like Netflix? Why use the proposed method over something simpler such as matrix factorization or collaborative filtering? <br />
<br />
It would also be nice if the authors provided some more ablation on the various components of the proposed method. Even after reading some of their experiments, we do not have a clear understanding of how important each component is to the recommendation quality.<br />
<br />
It is recommended that the author present how the encoder and the decoder help in the process by comparing different models with or without encoder and decoder.<br />
<br />
It would have been better if the author included all figures to explain the sensitivity of parameters more elaborately. In the paper he only included plots showing effects on Gowalla and Foursquare datasets but not other datasets.<br />
<br />
Some additional explanations can be inserted in the Objective Function part. From the paper, we can find <math>\lambda</math> is the regularization parameter and <math>W_a</math> and <math>w_t</math> are the learned parameters in the attention layer and aggregation layer.<br />
<br />
It would be better for the researchers to test the performance of the efficiency of this model, and compare it with other models to prove that this model is indeed better than other models.<br />
<br />
It would be more attractive if there is a section to introduce the applications that based on such algorithm in daily life. For instance, which application we use nowadays is based on this algorithm and what are the advantages of it compared to other similar algorithm?<br />
<br />
It would be useful to the readers if the authors briefly described covariates with which they are using to predict POIs. All that is detailed in this paper is that geographical information is used. Taking note of the characteristics that the data set the proposed method deals with is important, especially as phones become capabable of collecting more and more kinds of data. That is to say, the information available for predicting POI's in the future may be different to the information available today, so it is important to describe the data. (The information could even decrease in the future if privacy laws are enacted).<br />
<br />
The paper uses dataset from Gowalla, Yelp and Foursquare, which are all very well-known for providing crowd-sourced opinions about local restaurants. The provided example in the neighbour-aware decoder also uses Lazeez and Mr. Panino as an example in our daily lives. It would be useful to see how the POI recommendation system performs using datasets that are less geared towards restaurants and more towards other aspects of our lives.<br />
<br />
It would be a lot easier to understand the process if the pseudocode of the algorithm can be provided instead of just giving some equations.<br />
<br />
== References ==<br />
[1] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. 2014. GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation. In KDD. ACM, 831–840.<br />
<br />
[2] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.<br />
<br />
[3] Yiding Liu, Tuan-Anh Pham, Gao Cong, and Quan Yuan. 2017. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. PVLDB 10, 10 (2017), 1010–1021.<br />
<br />
[4] Jie Bao, Yu Zheng, David Wilkie, and Mohamed F. Mokbel. 2015. Recommendations in location-based social networks: a survey. GeoInformatica 19, 3 (2015), 525–565.<br />
<br />
[5] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.<br />
<br />
[6] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In ICDM. IEEE Computer Society, 263–272.<br />
<br />
[7] Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In KDD. ACM, 659–667. [12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).<br />
<br />
[8] Yong Liu,WeiWei, Aixin Sun, and Chunyan Miao. 2014. Exploiting Geographical<br />
Neighborhood Characteristics for Location Recommendation. In CIKM. ACM,<br />
739–748<br />
<br />
[9] Xutao Li, Gao Cong, Xiaoli Li, Tuan-Anh Nguyen Pham, and Shonali Krishnaswamy.<br />
2015. Rank-GeoFM: A Ranking based Geographical Factorization<br />
Method for Point of Interest Recommendation. In SIGIR. ACM, 433–442.<br />
<br />
[10] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. 2017. Bridging<br />
Collaborative Filtering and Semi-Supervised Learning: A Neural Approach for<br />
POI Recommendation. In KDD. ACM, 1245–1254.</div>H48changhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Semantic_Relation_Classification%E2%80%94%E2%80%94via_Convolution_Neural_NetworkSemantic Relation Classification——via Convolution Neural Network2020-11-22T19:51:06Z<p>Z224jian: /* Critiques */</p>
<hr />
<div><br />
<br />
<br />
== Presented by ==<br />
Rui Gong, Xinqi Ling, Di Ma,Xuetong Wang<br />
<br />
== Introduction ==<br />
A Semantic Relation can imply a relation between different words and a relation between different sentences or phrases. For example, the pair of words "white" and "snowy" can be synonyms, while "white" and "black" can be antonyms. It can be used for recommendation systems like YouTube, and understanding sentiment analysis. The study of semantic analysis involves determining the exact meaning of a text. For example, the word "date", can have different meanings in different contexts, like a calendar "date", a "date" fruit, or a romantic "date". <br />
<br />
One of the emerging trends of natural language technologies is their use for the humanities and sciences (Gbor et al., 2018). SemEval 2018 Task 7 mainly solves the problem of relation extraction and classification of two entities in the same sentence into 6 potential relations. The 6 relations are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE.<br />
<br />
SemEval 2018 Task 7 extracted data from 350 scientific paper abstracts, which has 1228 and 1248 annotated sentences for two tasks, respectively. For each data, an example sentence was chosen with its right and left sentences, as well as an indicator showing whether the relation is reserved, then a prediction is made. <br />
<br />
Three models were used for the prediction: Linear Classifiers, Long Short-Term Memory(LSTM), and Convolutional Neural Networks (CNN). Linear Classifier achieves the goal of classification by making a classification decision based on the value of a linear combination of the characteristics. LSTM is an artificial recurrent neural network (RNN) architecture well suited to classifying, processing and making predictions based on time series data. In the end, the prediction based on the CNN model was ultimately submitted since it performed the best among all models. By using the learned custom word embedding function, the research team added a variant of negative sampling, thereby improving performance and surpassing ordinary CNN.<br />
<br />
== Previous Work ==<br />
SemEval 2010 Task 8 (Hendrickx et al., 2010) explored the classification of natural language relations and studied the 9 relations between word pairs. However, it is not designed for scientific text analysis, and their challenge differs from the challenge of this paper in its generalizability; this paper’s relations are specific to ACL papers (e.g. MODEL-FEATURE), whereas the 2010 relations are more general, and might necessitate more common-sense knowledge than the 2018 relations. Xu et al. (2015a) and Santos et al. (2015) both applied CNN with negative sampling to finish task7. The 2017 SemEval Task 10 also featured relation extraction within scientific publications.<br />
<br />
== Algorithm ==<br />
<br />
[[File:CNN.png|800px|center]]<br />
<br />
This is the architecture of CNN. We first transform a sentence via Feature embeddings. Word representations are encoded by the column vector in the embedding matrix <math> W^{word} \in \mathbb{R}^{d^w \times |V|}</math>, where <math>V</math> is the vocabulary of the dataset. Each colummn is the word embedding vector for the <math>i^{th}</math> word in the vocabulary. This matrix is trainale during the optimization process and initialized by pre-trained emmbedding vectors. Basically, we transform each sentence into continuous word embeddings:<br />
<br />
$$<br />
(e^{w_i})<br />
$$<br />
<br />
And word position embeddings:<br />
$$<br />
(e^{wp_i}): e_i = [e^{w_i}, e^{wp_i}]<br />
$$<br />
<br />
In the word embeddings, we generated a vocabulary <math> V </math>. We will then generate an embedding word matrix based on the position of the word in the vocabulary. This matrix is trainable and needs to be initialized by pre-trained embedding vectors such as through GloVe or Word2Vec.<br />
<br />
In the word position embeddings, we first need to input some words named ‘entities,’ and they are the key for the machine to determine the sentence’s relation. During this process, if we have two entities, we will use the relative position of them in the sentence to make the<br />
embeddings. We will output two vectors, and one of them keeps track of the first entity relative position in the sentence ( we will make the entity recorded as 0, the former word recorded as -1 and the next one 1, etc. ). And the same procedure for the second entity. Finally, we will get two vectors concatenated as the position embedding. For example, in the sentence "the black '''cat''' jumped", the position embedding of "'''cat'''" is -2,-1,0,1.<br />
<br />
<br />
<br />
After the embeddings, the model will transform the embedded sentence into a fix-sized representation of the whole sentence via the convolution layer. Finally, after the max-pooling to reduce the dimension of the output of the layers, we will get a score for each relation class via a linear transformation.<br />
<br />
<br />
After featurizing all words in the sentence. The sentence of length N can be expressed as a vector of length <math> N </math>, which looks like <br />
$$e=[e_{1},e_{2},\ldots,e_{N}]$$<br />
and each entry represents a token of the word. Also, to apply <br />
convolutional neural network, the subsets of features<br />
$$e_{i:i+j}=[e_{i},e_{i+1},\ldots,e_{i+j}]$$<br />
are given to a weight matrix <math> W\in\mathbb{R}^{(d^{w}+2d^{wp})\times k}</math> to <br />
produce a new feature, defiend as <br />
$$c_{i}=\text{tanh}(W\cdot e_{i:i+k-1}+bias)$$<br />
This process is applied to all subsets of features with length <math> k </math> starting <br />
from the first one. Then a mapped feature factor is produced:<br />
$$c=[c_{1},c_{2},\ldots,c_{N-k+1}]$$<br />
<br />
<br />
The max pooling operation is used, the <math> \hat{c}=max\{c\} </math> was picked.<br />
With different weight filter, different mapped feature vectors can be obtained. Finally, the original <br />
sentence <math> e </math> can be converted into a new representation <math> r_{x} </math> with a fixed length. For example, if there are 5 filters,<br />
then there are 5 features (<math> \hat{c} </math>) picked to create <math> r_{x} </math> for each <math> x </math>.<br />
<br />
Then, the score vector <br />
$$s(x)=W^{classes}r_{x}$$<br />
is obtained which represented the score for each class, given <math> x </math>'s entities' relation will be classified as <br />
the one with the highest score. The <math> W^{classes} </math> here is the model being trained.<br />
<br />
To improve the performance, “Negative Sampling" was used. Given the trained data point <br />
<math> \tilde{x} </math>, and its correct class <math> \tilde{y} </math>. Let <math> I=Y\setminus\{\tilde{y}\} </math> represent the <br />
incorrect labels for <math> x </math>. Basically, the distance between the correct score and the positive margin, and the negative <br />
distance (negative margin plus the second largest score) should be minimized. So the loss function is <br />
$$L=\log(1+e^{\gamma(m^{+}-s(x)_{y})})+\log(1+e^{\gamma(m^{-}-\mathtt{max}_{y'\in I}(s(x)_{y'}))})$$<br />
with margins <math> m_{+} </math>, <math> m_{-} </math>, and penalty scale factor <math> \gamma </math>.<br />
The whole training is based on ACL anthology corpus and there are 25,938 papers with 136,772,370 tokens in total, <br />
and 49,600 of them are unique.<br />
<br />
== Results ==<br />
Unlike traditional hyper-parameter optimization, there are some modifications to the model that are necessary in order to increase performance on the test set. There are 5 modifications that can be applied:<br />
<br />
'''1.''' Merged Training Sets. It combined two training sets to increase the data set<br />
size and it improves the equality between classes to get better predictions.<br />
<br />
'''2.''' Reversal Indicate Features. It added a binary feature.<br />
<br />
'''3.''' Custom ACL Embeddings. It embedded a word vector to an ACL-specific<br />
corps.<br />
<br />
'''4.''' Context words. Within the sentence, it varies in size on a context window<br />
around the entity-enclosed text.<br />
<br />
'''5.''' Ensembling. It used different early stop and random initializations to improve<br />
the predictions.<br />
<br />
These modifications performances well on the training data and they are shown<br />
in table 3.<br />
<br />
[[File:table3.PNG|center]]<br />
<br />
<br />
<br />
As we can see the best choice for this model is ensembling as the random initialization made the data more natural and avoided the overfit.<br />
During the training process, there are some methods such that they can only<br />
increase the score on the cross-validation test sets but hurt the performance on<br />
the overall macro-F1 score. Thus, these methods were eventually ruled out.<br />
<br />
<br />
[[File:table4.PNG|center]]<br />
<br />
There are six submissions in total. Three for each training set and the result<br />
is shown in figure 2.<br />
<br />
The best submission for the training set 1.1 is the third submission which does not<br />
use a separate cross-validation dataset. Instead, a constant number of<br />
training epochs are run with cross-validation based on the training data.<br />
<br />
This compares to the other submissions in which 10% of the training data are<br />
removed to form the validation set (both with and without stratification).<br />
Predictions for these are made when the model has the highest validation accuracy. <br />
<br />
The best submission for the training set 1.2 is the submission which<br />
extracted 10% of the training data to form the validation dataset. Predictions are<br />
made when maximum accuracy is reached on the validation data.<br />
<br />
All in all, early stopping cannot always be based on the accuracy of the validation set<br />
since it cannot guarantee to get better performance on the real test set. Thus,<br />
we have to try new approaches and combine them to see the prediction<br />
results. Also, doing stratification will certainly improve the performance of<br />
the test data.<br />
<br />
== Conclusions ==<br />
Throughout the process, we have experimented with linear classifiers, sequential random forest, LSTM, and CNN models with various variations applied, such as two models of attention, negative sampling, entity embedding or sentence-only embedding, etc. <br />
<br />
Among all variations, vanilla CNN with negative sampling and ACL-embedding, without attention, has significantly better performance than all others. Attention-based pooling, up-sampling, and data augmentation are also tested, but they barely perform positive increment on the behavior.<br />
<br />
== Critiques == <br />
<br />
- Applying this in news apps might be beneficial to improve readability by highlighting specific important sections.<br />
<br />
- The data set come from 350 scientist papers, this could be more explained by the author on how those paper are selected and why those paper are important to discuss.<br />
<br />
- In the section of previous work, the author mentioned 9 natural language relationships between the word pairs. Among them, 6 potential relationships are USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC, and COMPARE. It would help the readers to better understand if all 9 relationships are listed in the summary.<br />
<br />
-This topic is interesting and this application might be helpful for some educational websites to improve their website to help readers focus on the important points. I think it will be nice to use Latex to type the equation in the sentence rather than center the equation on the next line. I think it will be interesting to discuss applying this way to other languages such as Chinese, Japanese, etc.<br />
<br />
- It would be a good idea if the authors can provide more details regarding ACL Embeddings and Context words modifications. Scores generated using these two modifications are quite close to the highest Ensembling modification generated score, which makes it a valid consideration to examine these two modifications in detail.<br />
<br />
- This paper is dealing with a similar problem as 'Neural Speed Reading Via Skim-RNN', num 19 paper summary. It will be an interesting approach to compare these two models' performance based on the same dataset.<br />
<br />
- I think it would be highly practical to implement this system as a page-rank system for search engines (such as google, bing, or other platforms like Facebook, Instagram, etc.) by finding the most prevalent information available in a search query and then matching the search to the related text which can be found on webpages. This could also be implemented in search bars on specific websites or locations as well.<br />
<br />
- It would be interesting to see in the future how the model would behave if data not already trained was used. This pre-trained data as mentioned in the paper had noise included. Using cleaner data would give better results maybe.<br />
<br />
- The selection of the training dataset, i.e. the abstracts of scientific papers, is an excellent idea since the abstracts usually contain more information than the body. But it may be also a good idea to train the model with the conclusions. Other than that, the result of applying the model to the body part of the scientific papers may show some interesting features of the model.<br />
<br />
- From Table 4 we find that comparing with using a fixed number of training periods, early stopping based on the accuracy of the validation set does not guarantee better test set performance. The label ratio of the validation set is layered according to the training set, which helps to improve the performance of the test set. Whether it is beneficial to add entity embedding as an additional feature could be an interesting point of discussion.<br />
<br />
- The author mentioned the use of CNNs for contextual understanding. NLP models based on CNNs have sometimes been inadequate for the use of understanding the context of words used in a body of text, being very prone to overfitting the context of the training data. It would help if more evidence was shown as to how the paper deals with this problem.<br />
<br />
- Deep neural network with a complex structure and huge parameter set is good at fitting the model. However, overfitting is a problem. Some strategies can be discussed like dropout strategies. Also, the choice of the most appropriate number of hidden layers is related to many factors like the scale of the training corpus. This can also be talked about in the paper.<br />
<br />
- It will be interesting to see: since it is CNN, applying dropout can improve the performance of the model like the one introduced in this paper [https://arxiv.org/pdf/1207.0580.pdf].<br />
<br />
- It would be better if the author could benchmark a variety of different regularization strategies to avoid overfitting, including dropout and data augmentation using generative networks. In addition, the author uses model averages to compute the ensembling network. It would also be interesting for the author to benchmark different ensembling approaches.<br />
<br />
== References ==<br />
[1] Diederik P Kingma and Jimmy Ba. 2014. Adam: A<br />
method for stochastic optimization. arXiv preprint<br />
arXiv:1412.6980.<br />
<br />
[2] DragomirR. Radev, Pradeep Muthukrishnan, Vahed<br />
Qazvinian, and Amjad Abu-Jbara. 2013. The ACL<br />
anthology network corpus. Language Resources<br />
and Evaluation, pages 1–26.<br />
<br />
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey<br />
Dean. 2013a. Efficient estimation of word<br />
representations in vector space. arXiv preprint<br />
arXiv:1301.3781.<br />
<br />
[4] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,<br />
and Jeff Dean. 2013b. Distributed representations<br />
of words and phrases and their compositionality.<br />
In Advances in neural information processing<br />
systems, pages 3111–3119.<br />
<br />
[5] Kata Gbor, Davide Buscaldi, Anne-Kathrin Schumann, Behrang QasemiZadeh, Hafa Zargayouna,<br />
and Thierry Charnois. 2018. Semeval-2018 task 7:Semantic relation extraction and classification in scientific papers. <br />
In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval2018), New Orleans, LA, USA, June 2018.</div>R6gonghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-ExcitationOne-Shot Object Detection with Co-Attention and Co-Excitation2020-11-21T15:52:53Z<p>W28shen: /* Introduction */</p>
<hr />
<div>== Presented By ==<br />
Gautam Bathla<br />
<br />
== Background ==<br />
<br />
Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image. The aim is to take a query image patch whose class label is not included in the training data and detect all instances of the same class in a target image.<br />
<br />
[[File:object_detection.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Object Detection on an image</div><br />
<br />
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.<br />
<br />
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and K > 1 for few-shot learning.<br />
<br />
== Introduction ==<br />
<br />
This paper tackles the problem of one-shot object detection, where the model needs to find all the instances in the target image of the object in the query image for a given query image ''p''. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category. In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process. This paper introduces squeeze and co-excitation to extract the feature of classification object and proved that this architecture performs well to do object detection in both seen and never-seen classes.<br />
<br />
== Previous Work ==<br />
<br />
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:<br />
<br />
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN [1].<br />
<br />
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet [2].<br />
<br />
The work done to tackle the problem of few-shot object detection is based on transfer learning [3], meta-learning [4], and metric-learning.<br />
<br />
1) Transfer Learning: Chen et al. [3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.<br />
<br />
2) Meta-Learning: Kang et al. [4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.<br />
<br />
3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.<br />
<br />
== Approach ==<br />
<br />
Let <math> C </math> be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:<br />
<br />
<div style="text-align: center;"><math> C = C_0 \bigsqcup C_1,</math></div><br />
<br />
where <math>C_0</math> represents the classes that the model is trained on and <math>C_1</math> represents the classes on which the inference is done.<br />
<br />
[[File:architecture_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Architecture</div><br />
<br />
Figure 2 shows the architecture of the model proposed in this paper. The model architecture is based on FasterRCNN [1], and ResNet-50 [5] has been used as the backbone for extracting features from the images. The target image and the query image are first passed through the ResNet-50 module to extract the features from the same convolutional layer. The features obtained are next passed into the Non-local block as input and the output consists of weighted features for each of the images. The new weighted feature set for both images is passed through Squeeze and Co-excitation block which outputs the re-weighted features which act as an input to the Region Proposal Network (RPN) module. RCNN module also consists of a new loss that is designed by the authors to rank proposals in order of their relevance.<br />
<br />
==== Non-Local Object Proposals ====<br />
<br />
The need for non-local object proposals arises because the RPN module used in Faster R-CNN [1] has access to bounding box information for each class in the training dataset. The dataset used for training and inference in the case of Faster R-CNN [1] is not exclusive. In this problem, as we have defined above that we divide the dataset into two parts, one part is used for training and the other is used during inference. Therefore, the classes in the two sets are exclusive. If the conventional RPN module is used, then the module will not be able to generate good proposals for images during inference because it will not have any information about the presence of bounding-box for those classes.<br />
<br />
To resolve this problem, a non-local operation is applied to both sets of features. This non-local operation is defined as:<br />
\begin{align}<br />
y_i = \frac{1}{C(z)} \sum_{\forall j}^{} f(x_i, z_j)g(z_j) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
where ''x'' is a vector on which this operation is applied, ''z'' is a vector which is taken as an input reference, ''i'' is the index of output position, ''j'' is the index that enumerates over all possible positions, ''C(z)'' is a normalization factor, <math>f(x_i, z_j)</math> is a pairwise function like Gaussian, Dot product, concatenation, etc., <math>g(z_j)</math> is a linear function of the form <math>W_z \times z_j</math>, and ''y'' is the output of this operation.<br />
<br />
Let the feature maps obtained from the ResNet-50 model be <math> \phi{(I)} \in R^{N \times W_I \times H_I} </math> for target image ''I'' and <math> \phi{(p)} \in R^{N \times W_p \times H_p} </math> for query image ''p''. Taking <math> \phi{(p)} </math> as the input reference, the non-local operation is applied to <math> \phi{(I)} </math> and results in a non-local block, <math> \psi{(I;p)} \in R^{N \times W_I \times H_I} </math> . Analogously, we can derive the non-local block <math> \psi{(p;I)} \in R^{N \times W_p \times H_p} </math> using <math> \phi{(I)} </math> as the input reference. <br />
<br />
We can express the extended feature maps as:<br />
<br />
\begin{align}<br />
{F(I) = \phi{(I)} \oplus \psi{(I;p)} \in R^{N \times W_I \times H_I}} \&nbsp;\&nbsp;;\&nbsp;\&nbsp; {F(p) = \phi{(p)} \oplus \psi{(p;I)} \in R^{N \times W_p \times H_p}} \tag{2} \label{eq:o1}<br />
\end{align}<br />
<br />
where ''F(I)'' denotes the extended feature map for target image ''I'', ''F(p)'' denotes the extended feature map for query image ''p'' and <math>\oplus</math> denotes element-wise sum over the feature maps <math>\phi{}</math> and <math>\psi{}</math>.<br />
<br />
As can be seen above, the extended feature set for the target image ''I'' do not only contain features from ''I'' but also the weighted sum of the target image and the query image. The same can be observed for the query image. This weighted sum is a co-attention mechanism and with the help of extended feature maps, better proposals are generated when inputted to the RPN module.<br />
<br />
==== Squeeze and Co-Excitation ====<br />
<br />
The two feature maps generated from the non-local block above can be further related by identifying the important channels and therefore, re-weighting the weights of the channels. This is the basic purpose of this module. The Squeeze layer summarizes each feature map by applying Global Average Pooling (GAP) on the extended feature map for the query image. The Co-Excitation layer gives attention to feature channels that are important for evaluating the similarity metric. The whole block can be represented as:<br />
<br />
\begin{align}<br />
SCE(F(I), F(p)) = w \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{p}) = w \odot F(p) \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{I}) = w \odot F(I)\tag{3} \label{eq:op2}<br />
\end{align}<br />
<br />
where ''w'' is the excitation vector, <math>F(\tilde{p})</math> and <math>F(\tilde{I})</math> are the re-weighted features maps for query and target image respectively.<br />
<br />
In between the Squeeze layer and Co-Excitation layer, there exist two fully-connected layers followed by a sigmoid layer which helps to learn the excitation vector ''w''. The ''Channel Attention'' module in the architecture is basically these fully-connected layers followed by a sigmoid layer.<br />
<br />
==== Margin-based Ranking Loss ====<br />
<br />
The authors have defined a two-layer MLP network ending with a softmax layer to learn a similarity metric which will help rank the proposals generated by the RPN module. In the first stage of training, each proposal is annotated with 0 or 1 based on the IoU value of the proposal with the ground-truth bounding box. If the IoU value is greater than 0.5 then that proposal is labeled as 1 (foreground) and 0 (background) otherwise.<br />
<br />
Let ''q'' be the feature vector obtained after applying GAP to the query image patch obtained from the Squeeze and Co-Excitation block and ''r'' be the feature vector obtained after applying GAP to the region proposals generated by the RPN module. The two vectors are concatenated to form a new vector ''x'' which is the input to the two-layer MLP network designed. We can define ''x = [<math>r^T;q^T</math>]''. Let ''M'' be the model representing the two-layer MLP network, then <math>s_i = M(x_i)</math>, where <math>s_i</math> is the probability of <math>i^{th}</math> proposal being a foreground proposal based on the query image patch ''q''.<br />
<br />
The margin-based ranking loss is given by:<br />
<br />
\begin{align}<br />
L_{MR}(\{x_i\}) = \sum_{i=1}^{K}y_i \times max\{m^+ - s_i, 0\} + (1-y_i) \times max\{s_i - m^-, 0\} + \delta_{i} \tag{4} \label{eq:op3}<br />
\end{align}<br />
\begin{align}<br />
\delta_{i} = \sum_{j=i+1}^{K}[y_i = y_j] \times max\{|s_i - s_j| - m^-, 0\} + [y_i \ne y_j] \times max\{m^+ - |s_i - s_j|, 0\} \tag{5} \label{eq:op4}<br />
\end{align}<br />
<br />
where ''[.]'' is the Iversion bracket, i.e. the output will be 1 if the condition inside the bracket is true and 0 otherwise, <math>m^+</math> is the expected lower bound probability for predicting a foreground proposal, <math>m^-</math> is the expected upper bound probability for predicting a background proposal and <math>K</math> is the number of candidate proposals from RPN.<br />
<br />
The total loss for the model is given as:<br />
<br />
\begin{align}<br />
L = L_{CE} + L_{Reg} + \lambda \times L_{MR} \tag{6} \label{eq:op5}<br />
\end{align}<br />
<br />
where <math>L_{CE}</math> is the cross-entropy loss, <math>L_{Reg}</math> is the regression loss for bounding boxes of Faster R-CNN [1] and <math>L_{MR}</math> is the margin-based ranking loss defined above.<br />
<br />
For this paper, <math>m^+</math> = 0.7, <math>m^-</math> = 0.3, <math>\lambda</math> = 3, K = 128, C(z) in \eqref{eq:op} is the total number of elements in a single feature map of vector ''z'', and <math>f(x_i, z_j)</math> in \eqref{eq:op} is a dot product operation.<br />
\begin{align}<br />
f(x_i, z_j) = \alpha(x_i)^T \beta(z_j)\&nbsp;\&nbsp;;\&nbsp;\&nbsp;\alpha(x_i) = W_{\alpha} x_i \&nbsp;\&nbsp;;\&nbsp;\&nbsp; \beta(z_j) = W_{\beta} z_j \tag{7} \label{eq:op6}<br />
\end{align}<br />
<br />
== Results ==<br />
<br />
The model is trained and tested on two popular datasets, VOC and COCO. The ResNet-50 model was pre-trained on a reduced dataset by removing all the classes present in the COCO dataset, thus ensuring that the model has not seen any of the classes belonging to the inference images.<br />
<br />
==== Results on VOC Dataset ====<br />
<br />
[[File: voc_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 1:''' Results on VOC dataset</div><br />
<br />
For the VOC dataset, the model is trained on the union of VOC 2007 train and validation sets and VOC 2012 train and validation sets, whereas the model is tested on VOC 2007 test set. From the VOC results (Table 1), it can be seen that the model with pre-trained ResNet-50 on a reduced training set as the CNN backbone (Ours(725)) achieves better performance on seen and unseen classes than the baseline models. When the pre-trained ResNet-50 on the full training set (Ours(1K)) is used as the CNN backbone, then the performance of the model is increased significantly.<br />
<br />
==== Results on MSCOCO Dataset ====<br />
<br />
[[File: mscoco_splits.png|750px|center|Image: 500 pixels]]<br />
[[File: mscoco_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2:''' Results on COCO dataset</div><br />
<br />
The model is trained on the COCO train2017 set and evaluated on the COCO val2017 set. The classes are divided into four groups and the model is trained with images belonging to three splits, whereas the evaluation is done on the images belonging to the fourth split. From Table 2, it is visible that the model achieved better accuracy than the baseline model. The bar chart value in the split figure shows the performance of the model on each class separately. The model is having some difficulties when predicting images belonging to classes like the book (split2), handbag (split3), and tie (split4) because of variations in their shape and textures.<br />
<br />
==== Overall Performance ====<br />
For VOC, the model that uses the reduced ImageNet model backbone with 725 classes achieves a better performance on both the seen and unseen classes. Remarkable improvements in the performance are seen with the backbone with 1000 classes. For COCO, the model achieves better accuracy than the Siamese Mask-RCNN model for both the seen and unseen classes.<br />
<br />
== Ablation Studies ==<br />
<br />
==== Effect of all the proposed techniques on the final result ====<br />
<br />
[[File: one_shot_detector_results.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 3:''' Effect of all thre techniques combined</div><br />
<br />
Figure 3 shows the effect of the three proposed techniques on the evaluation metric. The model performs worst when neither Co-attention nor Co-excitation mechanism is used. But, when either Co-attention or Co-excitation is used then the performance of the model is improved significantly. The model performs best when all the three proposed techniques are used.<br />
<br />
<br />
In order to understand the effect of the proposed modules, the authors analyzed each module separately.<br />
<br />
==== Visualizing the effect of Non-local RPN ====<br />
<br />
To demonstrate the effect of Non-local RPN, a heatmap of generated proposals is constructed. Each pixel is assigned the count of how many proposals cover that particular pixel and the counts are then normalized to generate a probability map.<br />
<br />
[[File: one_shot_non_local_rpn.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 4:''' Visualization of Non-local RPN</div><br />
<br />
From Figure 4, it can be seen that when a non-local RPN is used instead of a conventional RPN, the model is able to give more attention to the relevant region in the target image.<br />
<br />
==== Analyzing and Visualizing the effect of Co-Excitation ====<br />
<br />
To visualize the effect of excitation vector ''w'', the vector is calculated for all images in the inference set which are then averaged over images belonging to the same class, and a pair-wise Euclidean distance between classes is calculated.<br />
<br />
[[File: one_shot_excitation.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 5:''' Visualization of Co-Excitation</div><br />
<br />
From Figure 5, it can be observed that the Co-Excitation mechanism is able to assign meaningful weight distribution to each class. The weights for classes related to animals are closer to each other and the ''person'' class is not close to any other class because of the absence of common attributes between ''person'' and any other class in the dataset.<br />
<br />
[[File: analyzing_co_excitation_1.png|Analyzing Co-Exitation|500px|left|bottom|Image: 500 pixels]]<br />
<br />
[[File: analyzing_co_excitation_2.png|Analyzing Co-Excitation|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 6:''' Analyzing Co-Exitation</div><br />
<br />
To analyze the effect of Co-Excitation, the authors used two different scenarios. In the first scenario (Figure 6, left), the same target image is used for different query images. <math>p_1</math> and <math>p_2</math> query images have a similar color as the target image whereas <math>p_3</math> and <math>p_4</math> query images have a different color object as compared to the target image. When the pair-wise Euclidean distance between the excitation vector in the four cases was calculated, it can be seen that <math>w_2</math> was closer to <math>w_1</math> as compared to <math>w_4</math> and <math>w_3</math> was closer to <math>w_4</math> as compared to <math>w_1</math>. Therefore, it can be concluded that <math>w_1</math> and <math>w_2</math> give more importance to the texture of the object whereas <math>w_3</math> and <math>w_4</math> give more importance to channels representing the shape of the object.<br />
<br />
The same observation can be analyzed in scenario 2 (Figure 6, right) where the same query image was used for different target images. <math>w_1</math> and <math>w_2</math> are closer to <math>w_a</math> than <math>w_b</math> whereas <math>w_3</math> and <math>w_4</math> are closer to <math>w_b</math> than <math>w_a</math>. Since images <math>I_1</math> and <math>I_2</math> have a similar color object as the query image, we can say that <math>w_1</math> and <math>w_2</math> give more weightage to the channels representing the texture of the object, and <math>w_3</math> and <math>w_4</math> give more weightage to the channels representing shape.<br />
<br />
== Conclusion ==<br />
<br />
The resulting one-shot object detector outperforms all the baseline models on VOC and COCO datasets. The authors have also provided insights about how the non-local proposals, serving as a co-attention mechanism, can generate relevant region proposals in the target image and put emphasis on the important features shared by both target and query image.<br />
<br />
== Related Work ==<br />
<br />
'''Object detection''' SOTA object detectors are mostly deep convolutional neural networks. A popular pipeline is a two-stage approach, where detectors first generate a set of region proposals and then classify the proposals. The latest version is called Faster R-CNN [6], which works by replacing the grouping-based proposals originally found in R-CNN with a regional proposal network.<br />
<br />
'''Few-shot classification via metric learning''' The aim of this is to derive a similarity metric that can be used to infer unseen classes. One approach is to use Siamese networks, which learn a general similarity metric from using paired training data to decide whether the pair belongs to the same class. Then, during inference, the network matches unlabelled observations with a one-shot support set, where classification is done by asking which observed class is most similar [7].<br />
<br />
'''Few-shot object detection''' The problem of detection, like classification, can be assessed in a few-shot setting. However, the problem is quite novel so only preliminary results from transfer learning, meta learning, and metric learning exist.<br />
<br />
== Critiques ==<br />
<br />
The techniques proposed by the authors improve the performance of the model significantly as we saw that when either of Co-attention or Co-excitation is used along with Margin-based ranking loss then the model can detect the instances of query object in the target image. Also, the model trained is generic and does not require any training/fine-tuning to detect any unseen classes in the target image. The loss metric designed makes the learning process not to rely on only the labels of images since the proposed metric annotates each proposal as a foreground or a background which is then used to calculate the metric.<br />
Since it is exploiting many deep neural networks inside the main architecture, one critique that comes across is how time-consuming the proposed model is. The paper could have elucidated it more thoroughly whether the method is too time-consuming or not.<br />
<br />
== Source Code==<br />
[https://github.com/timy90022/One-Shot-Object-Detection link One-Shot-Object-Detection]<br />
<br />
== References ==<br />
<br />
[1] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[2] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 765–781, 2018<br />
<br />
[3] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2836–2843, 2018.<br />
<br />
[4] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. CoRR, abs/1812.01866, 2018.<br />
<br />
[5] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.<br />
<br />
[6] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[7] Gregory R. Koch. Siamese neural networks for one-shot image recognition. 2015.</div>Gbathlahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKSTHE LOGICAL EXPRESSIVENESS OF GRAPH NEURAL NETWORKS2020-11-19T19:17:57Z<p>Ssnaseem: </p>
<hr />
<div><br />
== Presented By ==<br />
Abhinav Jain<br />
<br />
== Background ==<br />
<br />
Graph neural networks (GNNs) (Merkwirth & Lengauer, 2005; Scarselli et al., 2009) are a class of neural network architectures that have recently become popular for a wide range of applications dealing with structured data such as molecule classification, knowledge graph completion, and Web page ranking (Battaglia et al., 2018; Gilmer et al., 2017; Kipf & Welling, 2017; Schlichtkrull et al., 2018). The main idea behind GNNs is that the connections between neurons are not arbitrary but reflect the structure of the input data. This approach is motivated by convolutional and recurrent neural networks and generalizes to both of them (Battaglia et al., 2018). Despite the fact that GNNs have recently been proven very efficient in many applications, their theoretical properties are not yet well-understood.<br />
<br />
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. The WL test works by constructing labeling of the nodes of the graph, in an incremental fashion, and then decides whether two graphs are isomorphic by comparing the labeling of each graph. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. To state the connection between GNNs and this test, consider the simple GNN architecture that updates the feature vector of each graph node by combining it with the aggregate of the feature vectors of its neighbors. Such GNNs are called aggregate-combine GNNs, or AC-GNNs. Moreover, there are AC-GNNs that can reproduce the WL labeling. This does not imply, however, that AC-GNNs can capture every node classifier—that is, a function assigning true or false to every node—that is refined by the WL test. This work aims to answer the question of what are the node classifiers that can be captured by GNN architectures such as AC-GNNs.<br />
<br />
== Introduction ==<br />
They tackle this problem by focusing on boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first-order logic. FOC2 is tightly related to the WL test, and hence to GNNs. They start by studying a popular class of GNNs called AC-GNNs in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. Given the connection between AC-GNNs and WL on the one hand, and that between WL and FOC2 on the other hand, one may be tempted to think that the expressivity of AC-GNNs coincides with that of FOC2. However, the reality is not as simple, and there are many FOC2 node classifiers (e.g., the trivial one above) that cannot be expressed by AC-GNNs. This leaves us with the following natural questions. First, what is the largest fragment of FOC2 classifiers that can be captured by AC-GNNs? Second, is there an extension of AC-GNNs that allows expressing all FOC2 classifiers? In this paper, they provide answers to these two questions. <br />
<br />
<br />
The following are the main contributions:<br />
<br />
1. They characterize exactly the fragment of FOC2 formulas that can be expressed as AC-GNNs. This fragment corresponds to graded modal logic (de Rijke, 2000) or, equivalently, to the description logic ALCQ, which has received considerable attention in the knowledge representation community (Baader et al., 2003; Baader & Lutz, 2007).<br />
<br />
2. Next, they extend the AC-GNN architecture in a very simple way by allowing global readouts, where in each layer they also compute a feature vector for the whole graph and combine it with local aggregations; they call these aggregate-combine-readout GNNs (ACR-GNNs). These networks are a special case of the ones proposed by Battaglia et al. (2018) for relational reasoning over graph representations. In this setting, they prove that an ACR-GNN can capture each FOC2 formula.<br />
<br />
They experimentally validate their findings showing that the theoretical expressiveness of ACR-GNNs, as well as the differences between AC-GNNs and ACR-GNNs, can be observed when they learn from examples. In particular, they show that on synthetic graph data conforming to FOC2 formulas, AC-GNNs struggle to fit the training data while ACR-GNNs can generalize even to graphs of sizes not seen during training.<br />
<br />
== Architecture ==<br />
This paper concentrates on the problem of boolean node classification: given a (simple, undirected) graph G = (V, E) in which each vertex v ∈ V has an associated feature vector xv, the authors aim to classify each graph node as true or false. This paper assumes that these feature vectors are one-hot encodings of node colors in the graph, from a finite set of colors. The neighborhood NG(v) of a node v ∈ V is the set {u | {v, u} ∈ E}. The basic architecture for GNNs, and the one studied in recent studies on GNN expressibility (Morris et al., 2019; Xu et al., 2019), consists of a sequence of layers that combine the feature vectors of every node with the multiset of feature vectors of its neighbors. Formally, let AGG and COM be two sets of aggregation and combination functions. An aggregate-combine GNN (AC-GNN) computes vectors <math>{x_v}^i</math> for every node v of the graph G, via the recursive formula<br />
<br />
[[File:a227-formula.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
Where each <math>{x_v}^0</math> is the initial feature vector <math>{x_v}</math> of v. Finally, each node v of G is classified according to a boolean classification function CLS applied to <math>{x_v}^{(L)}</math><br />
<br />
== Concepts ==<br />
=== 1. LOGICAL NODE CLASSIFIER ===<br />
Their study relates the power of GNNs to that of classifiers expressed in first-order (FO) predicate logic over (undirected) graphs where each vertex has a unique color (recall that they call these classifiers logical classifiers). For example,<br />
\[<br />
\alpha(x) := Red(x) \land \exists y (E(x,u) \land Blue(y)) \land \exists z (E(x,z) \land Green(z))<br />
\]<br />
has one free variable namely, <math> x </math> and two quantified variables <math> y </math> and <math> z </math>. Formally, the authors defined the following definition for a logical node calssifier.<br />
<br />
'''Definition 3.1''' A GNN classifier <math> \mathcal{A} </math> captures a logical classifier <math> \varphi (x) </math> if for every graph G and node v in G, it holds that <math> \mathcal{A}(G,v) = \textrm{true} </math> if and only if <math> (G,v) \models \varphi </math>.<br />
<br />
=== 2. LOGIC FOC2 ===<br />
The logic FOC2 allows for formulas using all FO constructs and counting quantifiers, but restricted to only two variables. Note that in terms of their logical expressiveness, FOC2 is strictly less expressive than FO (as counting quantifiers can always be mimicked in FO by using more variables and disequalities), but is strictly more expressive than FO2 - the fragment of FO that allows formulas to use only two variables (as β(x) belongs to FOC2 but not to FO2). The author gives the following proposition regarding the choice of logic FOC2 for measuring the expressiveness of AC-GNNs.<br />
<br />
'''Proposition 3.2''' For any graph G and nodes u,v in G, the WL test colors v and u the same after any number of rounds if and only if u and v are classified the same by all FOC2 classifiers.<br />
<br />
=== 3. FOC2 AND AC-GNN CLASSIFIER ===<br />
While it is true that two nodes are declared indistinguishable by the WL test if and only if they are indistinguishable by all FOC2 classifiers (Proposition 3.2), and if the former holds then such nodes cannot be distinguished by AC-GNNs (Proposition 2.1), this by no means tells us that every FOC2 classifier can be expressed as an AC-GNN. The answer to this problem is covered in the next section.<br />
<br />
=== THE EXPRESSIVE POWER OF AC-GNNS ===<br />
AC-GNNs capture any FOC2 classifier as long as they further restrict the formulas so that they satisfy such a locality property. This happens to be a well-known restriction of FOC2 and corresponds to graded modal logic (de Rijke, 2000), which is fundamental for knowledge representation. The idea of graded modal logic is to force all sub-formulas to be guarded by the edge predicate E. This means that one cannot express in graded modal logic arbitrary formulas of the form ∃yϕ(y), i.e., whether some node satisfies property ϕ. Instead, one is allowed to check whether some neighbor y of the node x where the formula is being evaluated satisfies ϕ. That is, they are allowed to express the formula ∃y (E(x, y) ∧ ϕ(y)) in the logic as in this case ϕ(y) is guarded by E(x, y).<br />
<br />
The relationship between AC-GNNs and graded modal logic goes further: they can show that graded modal logic is the “largest” class of logical classifiers captured by AC-GNNs. This means that the only FO formulas that AC-GNNs are able to learn accurately are those in graded modal logic.<br />
<br />
According to their theorem, a logical classifier is captured by AC-GNNs if and only if it can be expressed in graded modal logic. This holds no matter which aggregate and combines operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
The backward direction of this theorem is that a simple homogeneous AC-GNN captures each graded modal logic classifier.<br />
They point out that the forward direction holds no matter which aggregate and combine operators are considered, i.e., this is a limitation of the architecture for AC-GNNs, not of the specific functions that one chooses to update the features.<br />
<br />
=== ACR-GNNs ===<br />
The main shortcoming of AC-GNNs for expressing such classifiers is their local behavior. A natural way to break such a behavior is to allow for a global feature computation on each layer of the GNN. This is called a global attribute computation in the framework of Battaglia et al. (2018). Following the recent GNN literature (Gilmer et al., 2017; Morris et al., 2019; Xu et al., 2019), they refer to this global operation as a readout. Formally, an aggregate-combine-readout GNN (ACR-GNN) extends AC-GNNs by specifying readout functions READ(i), which aggregate the current feature vectors of all the nodes in a graph.<br />
Then, the vector <math>{x_v}^i</math> of each node v in G on each layer i is computed by the following formula:<br />
<br />
[[File:a227-formula-final.png|700px|center|Image: 700 pixels]]<br />
<br />
Intuitively, every layer in an ACR-GNN first computes (i.e., “reads out”) the aggregation over all the nodes in G; then, for every node v, it computes the aggregation over the neighbors of v; and finally, it combines the features of v with the two aggregation vectors.<br />
<br />
They know that AC-GNNs cannot capture this classifier. However, using a single readout plus local aggregations one can implement this classifier as follows. First, define by B the property as “having at least 2 blue neighbors”. Then an ACR-GNN that implements γ(x) can (1) use one aggregation to store in the local feature of every node if the node satisfies B, then (2) use a readout function to count how many nodes satisfying B exist in the whole graph, and (3) use another local aggregation to count how many neighbors of every node satisfy B.<br />
<br />
They then show that just one readout is enough. However, this reduction in the number of readouts comes at the cost of severely complicating the resulting GNN. Formally, an aggregate-combine GNN with final readout (AC-FR-GNN) results from using any number of layers as in the AC-GNN definition, together with a final layer uses a readout function.<br />
<br />
== Experiments ==<br />
The authors performed experiments with synthetic data to empirically validate their results. They perform two sets of experiments: experiments to show that ACR-GNNs can learn a very simple FOC2 node classifier that AC-GNNs cannot learn, and experiments involving complex FOC2 classifiers that need more intermediate readouts to be learned. Besides testing simple AC-GNNs, they also tested the GIN network proposed by Xu et al. (2019) (they consider the implementation by Fey & Lenssen (2019) and adapted it to classify nodes). Their experiments use synthetic graphs, with five initial colors encoded as one-hot features, divided into three sets: the train set with 5k graphs of size up to 50-100 nodes, the test set with 500 graphs of a size similar to the train set, and another test set with 500 graphs of size bigger than the train set. They tried several configurations for the aggregation, combination readout functions, and report the accuracy on the best configuration. In their experiments, accuracy is computed as the total number of nodes correctly classified among all nodes in all the graphs in the dataset. In every case, they run up to 20 epochs with the Adam optimizer. <br />
<br />
[[File:a227_table1.png|600px|center|Image: 600 pixels]]<br />
<br />
[[File:a227_table2.png|560px|center|Image: 600 pixels]]<br />
<br />
For both types of graphs, already single-layer ACR-GNNs showed perfect performance (ACR-1 in Table 1). This was what they expected given the simplicity of the property being checked. In contrast, AC-GNNs and GINs (shown in Table 1 as AC-L and GINL, representing AC-GNNs and GINs with L layers) struggle to fit the data. For the case of the line-shaped graph, they were not able to fit the train data even by allowing 7 layers. For the case of random graphs, the performance with 7 layers was considerably better.<br />
<br />
Table 2 above corresponds to the results E-R synthetic data for nodes labeled by the below classifier. ACR-GNNs performance up to 3 layers is reported. For the bigger test set, it was also observed that AC-GNNs and GINs are unable to substantially depart from a trivial baseline of 50%.<br />
<br />
[[File:a227eq6.png|400px|center|Image: 400 pixels]]<br />
<br />
'''Statistics of the datasets used for the above equation is shown below''' <br />
<br />
[[File:Paper13_Statistics_Dataset.png|center]]<br />
<br />
Results for Erdos-Renyi synthetic graphs with different connectivities are shown below<br />
<br />
[[File:CaptureLogical.PNG|center]]<br />
<br />
== Final Remarks ==<br />
<br />
The paper's results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds, You et al. (2019) construct node features by computing shortest-path distances to a set of distant anchor nodes, and Haonan et al. (2019) introduced the idea of a “star node” that stores global information of the graph. As mentioned before, their work is close in spirit to that of Xu et al. (2019) and Morris et al. (2019) establishing the correspondence between the WL test and GNNs.<br />
<br />
Regarding the results on the links between AC-GNNs and graded modal logic (Theorem 4.2), the very recent work of Sato et al. (2019) establishes close relationships between GNNs and certain classes of distributed local algorithms. These in turn have been shown to have strong correspondences with modal logics (Hella et al., 2015).<br />
== Source Code ==<br />
<br />
The code for this paper is freely available at [https://github.com/juanpablos/GNN-logic link GNN-logic]<br />
== Conclusion ==<br />
The authors were successful in establishing their claims with the help of ACR-GNNs. The results show the theoretical advantages of mixing local and global information when classifying nodes in a graph. Recent works have also observed these advantages in practice, e.g., Deng et al. published as a conference paper at ICLR 2020 (2018) use global-context aware local descriptors to classify objects in 3D point clouds.<br />
The authors would like to study how their results can be applied for extracting logical formulas from GNNs as possible explanations for their computations.<br />
<br />
== Critiques==<br />
<br />
The paper has been quite successful in solving the problem of binary classifiers in GNNs. The paper was released in 2019 and has already been cited 22 times. The content structure is very well organized, and the explanations are easy to understand for an average reader. They have also discussed future work and possibilities. They could have given more commentary about the performance difference across different classifiers.<br />
<br />
The fact that no actual difference in performance between AC-GNNs and ACR-GNNs was noticed in the only non-synthetic dataset used in the experiment should prompt the author to run experiments with more real-life datasets to verify the results empirically.<br />
<br />
== References ==<br />
[1] Franz Baader and Carsten Lutz. Description logic. In Handbook of modal logic, pp. 757–819. North-Holland, 2007.<br />
<br />
[2] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. PatelSchneider (eds.). The description logic handbook: theory, implementation, and applications. Cambridge University Press, 2003.<br />
<br />
[3] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vin´ıcius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, C¸ aglar Gulc¸ehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish ¨ Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL http://arxiv.org/abs/1806.01261.<br />
<br />
[4] Jin-Yi Cai, Martin Furer, and Neil Immerman. ¨ An optimal lower bound on the number of variables for graph identification. Combinatorica, 12(4):389–410, 1992.<br />
<br />
[5] Ting Chen, Song Bian, and Yizhou Sun. Are powerful graph neural nets necessary? A dissection on graph classification. CoRR, abs/1905.04579, 2019. URL https://arxiv.org/abs/1905.04579.</div>A227jainhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processesFunctional regularisation for continual learning with gaussian processes2020-11-18T02:38:25Z<p>P2torabi: </p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is that a model forgets how to solve earlier tasks. This paper proposed a new framework to regularise CL so that it does not forget previously learned tasks. This method, referred to as functional regularisation for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then used in optimisation as a regulariser to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimisation of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularisation-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularise the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularising weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimised using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularisation-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularises the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinite-dimensional generalisation of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimisation in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterised by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrised by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== A large class of examples of Gaussian processes ==<br />
<br />
The prima facie example of a Gaussian process in continuous time is a Brownian motion. It turns out that Brownian motion is a key ingredient in constructing a large class of Gaussian processes, which can be achieved through the Wiener integral. We write this down as a Proposition and prove it below.<br />
<br />
'''Proposition''' Let <math>B = \{ B_t \}_{t \geq 0}</math> be a Brownian motion on a filtered probability space and <math>f</math> a square integrable deterministic function. Then the process <math>X_t</math> given by the Wiener integral <math>X_t = \int_0^t f(s) dB_s</math> is a Gaussian process.<br />
<br />
For a reference to several elementary constructions of the Wiener integral, we refer the reader to the textbook by Kuo [6]. Intuitively, the stochastic process <math>X_t</math> can be thought as the gain or loss of the strategy <math>f(s)</math> when playing a fair game induced by tracking a Brownian motion.<br />
<br />
''Proof'' The Wiener integral of a simple function is the sum of independent centred normal random variables, which is in turn a centred Gaussian process. Since the space of Gaussian processes is closed in <math>L^2 (\Omega \times [0,T])</math>, as we take the limit as in the construction of the Wiener integral, the process converges and remains a centred Gaussian process. QED<br />
<br />
Using the above proposition, one simple example of a Gaussian process is a Brownian bridge, defined as <math>Y_t = (1-t) \int_0^t \frac{dB_s}{1-s}</math> for <math>0 \leq t < 1</math>. This process can be proved to be Gaussian using the proposition proved above.<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German datasets. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarise information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrise <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrised by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimising the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximising the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularise the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximised is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
As a result, we regularise the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimisation computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularisation term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrised by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimised inducing images cover examples from all classes as opposed to the randomised inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasise that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot [7] benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimised using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularisation-based method and the replay/rehearsal method in continual learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimisation scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularisation terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularisation terms can make optimisation more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
The problem of balancing short term and long term memory of rewards is an important problem in reinforcement learning. It would be interesting to consider the applications of this algorithm and its variants to reinforcement learning problems. Additionally, for problems where the rewards may change over time, it would be interesting to compare this method to some of the existing ones.<br />
<br />
A useful resource for more continual learning in general is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.<br />
<br />
[6] Kuo, H. "Introduction to Stochastic Integration Springer." Berlin Heidelberg (2006).<br />
<br />
[7] Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332-1338.</div>M372chenhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_RetrievalPre-Training Tasks For Embedding-Based Large-Scale Retrieval2020-11-17T23:53:48Z<p>P2torabi: </p>
<hr />
<div><br />
==Contributors==<br />
Pierre McWhannel wrote this summary with editorial and technical contributions from STAT 946 fall 2020 classmates. The summary is based on the paper "Pre-Training Tasks for Embedding-Based Large-Scale Retrieval" which was presented at ICLR 2020. The authors of this paper are Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar [1].<br />
<br />
==Introduction==<br />
The problem domain in which the paper is positioned is large-scale query-document retrieval. Given a query/question, the task is to collect relevant documents which contain the answer(s) and identify the span of words which contain the answer(s). This problem is often separated into two steps. First, a retrieval phase which reduces a large corpus to a much smaller subset. Variants of the classical probabilistic retrieval model BM-25 are usually used during this phase to rapidly filter a set of documents. BM-25 is considered as one of strongest baseline algorithms for IR and is known for being highly effective. It is build off of TF-IDF (term frequency, inverse document frequency) which accounts for document length and term frequency saturation. Second, a reader which is applied to read the shortened list of documents and score them based on their ability to answer the query and identify the span containing the answer, these spans can be ranked according to some scoring function. In other words, in the first step the goal is to use a model with high recall, and in the second one with high precision. In the setting of an open-domain question answering, the query, <math>q</math>, is a question and the documents <math>d</math> represent a passage which may contain the answer(s). The focus of this paper is on the retriever question answering (QA), in this setting a scoring function <math>f: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} </math> is utilized to map <math>(q,d) \in \mathcal{X} \times \mathcal{Y}</math> to a real value which represents the score of how relevant the passage is to the query. We desire high scores for relevant passages and low scores otherwise and this is the job of the retriever. <br />
<br />
<br />
Certain characteristics desired of the retriever are high recall since the reader will only be applied to a subset meaning there is an opportunity to identify false positives but not false negatives. The other characteristic desired of a retriever is low latency meaning it is computationally efficient to be used for real-world applications. The popular approach to meet these heuristics has been the BM-25 [2] which relies on token-based matching between two high-dimensional and sparse vectors representing <math>q</math> and <math>d</math>. This approach performs well and retrieval can be performed in time sublinear to the number of passages with inverter indexing. The downfall of this type of algorithm is its inability to be optimized for a specific task. An alternate option is BERT [3] or transformer-based models [4] with cross attention between query and passage pairs which can be optimized for a specific task. For example, BERT assigned both masked language model and Next Sentence Prediction, pre-trained on various open resources data(BookCorpus + English WIKIPEDIA, CC NEWS, OpenWebtex, and Stories) with total of 160 GB. However, these models suffer from latency as the retriever needs to process all the pairwise combinations of a query with all passages. The next type of algorithm is an embedding-based model that jointly embeds the query and passage in the same embedding space and then uses measures such as cosine distance, inner product, or even the Euclidean distance in this space to get a score. The authors suggest the two-tower models can capture deeper semantic relationships in addition to being able to be optimized for specific tasks. This model is referred to as the "two-tower retrieval model", since it has two transformer-based models where one embeds the query <math>\phi{(\cdot)}</math> and the other the passage <math>\psi{(\cdot)}</math>, then the embeddings can be scored by <math>f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. This model can avoid the latency problems of a cross-layer model by pre-computing <math>\psi{(d)}</math> beforehand and then by utilizing efficient approximate nearest neighbor search algorithms in the embedding space to find the nearest documents.<br />
<br />
<br />
These two-tower retrieval models often use pre-trained weights from BERT. In BERT pre-trained model, the weight is obtained by the pre-training tasks are masked-LM (MLM) and next sentence prediction (NSP). The authors note that the research of pre-training tasks that improve the performance of two-tower retrieval models is an unsolved research problem. This is exactly what the authors have done in this paper. That is, to develop pre-training tasks for the two-tower model based on heuristics and then validated by experimental results. The pre-training tasks suggested are Inverse Cloze Task (ICT), Body First Search (BFS), Wiki Link Prediction (WLP), and combining all three. The authors used the Retrieval Question-Answering ('''ReQA''') benchmark [5] and used two datasets '''SQuAD'''[6] and '''Natural Questions'''[7] for training and evaluating their models.<br />
<br />
<br />
<br />
===Contributions of this paper===<br />
*The two-tower transformer model with proper pre-training can significantly outperform the widely used BM-25 Algorithm.<br />
*Paragraph-level pre-training tasks such as ICT, BFS, and WLP hugely improve the retrieval quality, whereas the most widely used pre-training task (the token-level masked-LM) gives only marginal gains.<br />
*The two-tower models with deep transformer encoders benefit more from paragraph-level pre-training compared to its shallow bag-of-word counterpart (BoW-MLP)<br />
<br />
==Background on Two-Tower Retrieval Models==<br />
This section is to present the two-tower model in more detail. As seen previously <math>q \in \mathcal{X}</math>, <math> d \in \mathcal{y}</math> are the query and document or passage respectively. The two-tower retrieval model consists of two encoders: <math>\phi: \mathcal{X} \rightarrow \mathbb{R}^k</math> and <math>\psi: \mathcal{Y} \rightarrow \mathbb{R}^k</math>. The tokens in <math>\mathcal{X},\mathcal{Y}</math> are a sequence of tokens representing the query and document respectively. The scoring function is <math> f(q,d) = \langle \phi{(q)},\psi{(d)} \rangle \in \mathbb{R} </math>. The cross-attention BERT-style model would have <math> f_{\theta,w}(q,d) = \psi_{\theta}{(q \oplus d)^{T}}w </math> as a scoring function and <math> \oplus </math> signifies the concatenation of the sequences of <math>q</math> and <math>d</math>, <math> \theta </math> are the parameters of the cross-attention model. The architectures of these two models can be seen below in figure 1.<br />
<br />
<br/><br />
[[File:two-tower-vs-cross-layer.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Comparisons===<br />
<br />
====Inference====<br />
In terms of inference, two-tower architecture has a distinct advantage as being easily seen from the architecture. Since the query and document are embedded independently, <math> \psi{(d)}</math> can be pre-computed. This is an important advantage as it makes the two-tower method feasible. As an example, the authors state the inner product between 100 query embeddings and 1 million document embeddings only takes hundreds of milliseconds on CPUs, however, for the equivalent calculation for a cross-attention-model it takes hours on GPUs. In addition, they remark the retrieval of the relevant documents can be done in sublinear time with respect <math> |\mathcal{Y}| </math> using a maximum inner product (MIPS) algorithm with little loss in the recall.<br />
<br />
====Learning====<br />
In terms of learning the two-tower model compared to BM-25 has a unique advantage of being possible to be fine-tuned for downstream tasks. This problem is within the paradigm of metric learning as the scoring function can be passed as input to an objective function of the log likelihood <math> \max_{\theta} \space log(p_{\theta}(d|q)) </math>. The conditional probability is represented by a SoftMax and then can be rewritten as:<br />
<br />
<br />
<math> p_{\theta}(d|q) = \frac{exp(f_{\theta}(q,d))}{\sum_{d' \in D} exp(f_{\theta}(q,d'))} \tag{1} \label{eq:sampled_SoftMax}</math><br />
<br />
<br />
Here <math> \mathcal{D} </math> is the set of all possible documents, the denominator is a summation over <math>\mathcal{D} </math>. As is customary, they utilize a sampled SoftMax technique to approximate the full-SoftMax. The technique they have employed is to use a small subset of documents in the current training batch, while also using a proper correcting term to ensure the unbiasedness of the partition function.<br />
<br />
====Why do we need pre-training then?====<br />
Although the two-tower model can be fine-tuned for a downstream task getting sufficiently labeled data is not always possible. This is an opportunity to look for better pre-training tasks as the authors have done. The authors suggest getting a set of pre-training task <math> \mathcal{T} </math> with labeled positive pairs of queries of documents/passages and then afterward fine-tuning on the limited amount of data for the downstream task.<br />
<br />
==Pre-Training Tasks==<br />
The authors suggest three pre-training tasks for the encoders of the two-tower retriever, these are ICT, BFS, and WLP. Keep in mind that ICT [8] is not a newly proposed task. These pre-training tasks are developed in accordance with two heuristics they state and our paragraph-level pre-training tasks in contrast to more common sentence-level or word-level tasks:<br />
<br />
# The tasks should capture different resolutions of semantics between query and document. The semantics can be the local context within a paragraph, global consistency within a document, and even semantic relation between two documents.<br />
# Collecting the data for the pre-training tasks should require as little human intervention as possible and be cost-efficient.<br />
<br />
All three of the following tasks can be constructed in a cost-efficient manner and with minimal human intervention by utilizing Wikipedia articles. In figure 2 below the dashed line surrounding the block of text is to be regarded as the document <math> d </math> and the rest of the circled portions of text make up the three queries <math>q_1, q_2, q_3 </math> selected for each task.<br />
<br />
<br/><br />
[[File:pre-train-tasks.PNG|center|Figure 1: end-to-end model]]<br />
<br/><br />
<br />
===Inverse Cloze Task (ICT)===<br />
This task is given a passage <math> p </math> consisting of <math> n </math> sentences <math>s_{i}, i=1,..,n </math>. <math> s_{i} </math> is randomly sampled and used as the query <math> q </math> and then the document is constructed to be based on the remaining <math> n-1 </math> sentences. This is utilized to meet the local context within a paragraph of the first heuristic. In figure 2 <math> q_1 </math> is an example of a potential query for ICT based on the selected document <math> d </math>.<br />
<br />
===Body First Search (BFS)===<br />
BFS is <math> (q,d) </math> pairs are created by randomly selecting a sentence <math> q_2 </math> and setting it as the query from the first section of a Wikipedia page which contains the selected document <math> d </math>. This is utilized to meet the global context consistency within a document as the first section in Wikipedia generally is an overview of the whole page and they anticipate it to contain information central to the topic. In figure 2 <math> q_2 </math> is an example of a potential query for BFS based on the selected document <math> d </math>.<br />
<br />
===Wiki Link Prediction (WLP)===<br />
The third task is selecting a hyperlink that is within the selected document <math> d </math>, as can be observed in figure 2 "machine learning" is a valid option. The query <math> q_3 </math> will then be a random sentence on the first section of the page where the hyperlinks redirects to. This is utilized to provide the semantic relation between documents.<br />
<br />
<br />
Additionally, Masked LM (MLM) is considered during the experimentations which again is one of the primary pre-training tasks used in BERT.<br />
<br />
==Experiments and Results==<br />
<br />
===Training Specifications===<br />
<br />
*Each tower is constructed from the 12 layers BERT-base model.<br />
*Embedding is achieved by applying a linear layer on the [CLS] token output to get an embedding dimension of 512.<br />
*Sequence lengths of 64 and 288 respectively for the query and document encoders.<br />
*Pre-trained on 32 TPU v3 chips for 100k step with Adam optimizer learning rate of <math> 1 \times 10^{-4} </math> with 0.1 warm-up ratio, followed by linear learning rate decay and batch size of 8192. (~2.5 days to complete)<br />
*Fine tuning on downstream task <math> 5 \times 10^{-5} </math> with 2000 training steps and batch size 512.<br />
*Tokenizer is WordPiece.<br />
<br />
===Pre-Training Tasks Setup===<br />
<br />
*MLM, ICT, BFS, and WLP are used.<br />
*Various combinations of the aforementioned tasks are tested.<br />
*Tasks define <math> (q,d) </math> pairings.<br />
*ICT's document <math> d </math> has the article title and passage separated by a [SEP] symbol as input to the document encoder.<br />
<br />
Below table 1 shows some statistics for the datasets constructed for the ICT, BFS, and WLP tasks. Token counts considered after the tokenizer is applied.<br />
<br />
[[File: wiki-pre-train-statistics.PNG | center]]<br />
<br />
===Downstream Tasks Setup===<br />
Retrieval Question-Answering ('''ReQA''') benchmark used.<br />
<br />
*Datasets: '''SQuAD''' and '''Natural Questions'''.<br />
*Each entry of data from datasets is <math> (q,a,p) </math>, where each element is, query <math> q </math>, answer <math> a </math>, and passage <math> p </math> containing the answer, respectively.<br />
*Authors split the passage to sentences <math>s_i</math> where <math> i=1,...n </math> and <math> n </math> is the number of sentences in the passage.<br />
*The problem is then recast for a given query to retrieve the correct sentence-passage <math> (s,p) </math> from all candidates for all passages split in this fashion.<br />
*They remark their reformulation makes it a more difficult problem.<br />
<br />
Note: Experiments done on '''ReQA''' benchmarks are not entirely open-domain QA retrieval as the candidate <math> (s,p) </math> only cover the training set of QA dataset instead of entire Wikipedia articles. There are some experiments they did with augmented the dataset as will be seen in table 6.<br />
<br />
Below table 2 shows statistics of the datasets used for the downstream QA task.<br />
<br />
<br />
[[File:data-statistics-30.PNG | center]]<br />
<br />
===Evaluation===<br />
<br />
*3 train/test splits: 1%/99%, 5%/95%, and 80%/20%.<br />
*10% of the training set is used as a validation set for hyper-parameter tuning.<br />
*Training never has seen the test queries in the test pairs of <math>(q,d).</math><br />
*Evaluation metric is recall@k since the goal is to capture positives in the top-k results.<br />
<br />
<br />
===Results===<br />
<br />
Table 3 shows the results on the '''SQuAD''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance, with the exception of R@1 for 1%/99% train/test split. BoW-MLP is used to justify the use of a transformer as the encoder since it has more complexity, BoW-MLP here looks up uni-grams from an embedding table, aggregates the embeddings with average pooling, and passes them through a shallow two-layer ML network with <math> tanh </math> activation to generate the final 512-dimensional query/document embeddings.<br />
<br />
[[File:table3-30.PNG | center]]<br />
<br />
Table 4 shows the results on the '''Natural Questions''' dataset, as observed the performance using the combination of all 3 pre-training tasks has the best performance in all testing scenarios.<br />
<br />
[[File:table4-30.PNG | center]]<br />
<br />
====Ablation Study====<br />
The embedding-dimensions, # of layers in BERT, and pre-training tasks are altered for this ablation study as seen in Table 5.<br />
<br />
All three tasks, ICT, BFS and WLP, perform significantly better than MLM when used individually. When compared with each other, ICT performs better than the other two tasks, followed by BFS and then WLP. Interestingly ICT+BFS+WLP is only an absolute of 1.5% better than just ICT in the low-data regime. The authors infer this could be representative of when there is insufficient downstream data for training that more global pre-training tasks become increasingly beneficial since BFS and WLP offer this.<br />
<br />
[[File:table5-30.PNG | center]]<br />
<br />
====Evaluation of Open-Domain Retrieval====<br />
Augment ReQA benchmark with large-scale (sentence, evidence passage) pairs extracted from general Wikipedia. This is added as one million external candidate pairs into the existing retrieval candidate set of the ReQA benchmark. <br />
<br />
[[File:table6-30.PNG | center]]<br />
<br />
The 3 talking points about the above results are as follows <br><br />
• The two-tower Transformer models pretrained with ICT+BFS+WLP and ICT substantially outperform the BM-25 baseline. <br><br />
• ICT+BFS+WLP pre-training method consistently improves the ICT pre-training method <br><br />
• Results here are fairly consistent with previous results.<br />
<br />
==Conclusion==<br />
<br />
The authors performed a comprehensive study and have concluded that properly designing paragraph-level pre-training tasks including ICT, BFS, and WLP for a two-tower transformer model can significantly outperform the BM-25 algorithm, which is an unsupervised method using weighted token-matching as a scoring function. The findings of this paper suggest that a two-tower transformer model with proper pre-training tasks can replace the BM25 algorithm used in the retrieval stage to further improve end-to-end system performance. The authors plan to test the pre-training tasks with a larger variety of encoder architectures and trying other corpora than Wikipedia and additionally trying pre-training in contrast to different regularization methods.<br />
<br />
==Critique==<br />
<br />
The authors of this paper suggest from their experiments that the combination of the 3 pretraining tasks ICT, BFS, and WLP yields the best results in general. However, they do acknowledge why two of the three pre-training tasks may not produce equally significant results as the three. From the ablation study and the small margin (1.5%) in which the three tasks achieved outperformed only using ICT in combination with observing ICT outperforming the three as seen in table 6 in certain scenarios. It seems plausible that a subset of two tasks could also potentially outperform the three or reach a similar performance. I believe this should be addressed in order to conclude three pre-training tasks are needed over two, though they never address this comparison I think it would make the paper more complete.<br />
<br />
The two-tower transformer model has two encoders, one for query encoding, and one for passage encoding. But in theory, queries and passages are all textual information. Do we really need two different encoders? Maybe using one encoder to encode both queries and passages can achieve competitive results. I'd like to see if any paper has ever covered this.<br />
<br />
As this paper falls under the family of approaches that involves the dual encoder architecture where the query and document are encoded independently of each other and efficient retrieval is achieved using approximate nearest neighbor search as they have used and they have just shown that such dilemma can be resolved with matching-oriented pretraining that imitate the matching between the question and paragraph in open-domain QA. However, the existing pretraining strategies could be highly sample-inefficient and typically require a very large batch size (up to thousands) such that diverse and effective negative question-paragraph pairs could be included in each batch. As mentioned in Equation \eqref{eq:sampled_SoftMax}, they utilized a sampled SoftMax technique to approximate the full-SoftMax, which has the main advantage which saves a lot of memory during training instead of load the whole dataset to the memory. Another thing that might be interesting to implement is to use dense representation for the 3 pretraining tasks ICT, BFS, and WLP as mention in this paper [9], which also used a different way of in-batch sampling during the training of the encoders and last and not least, instead of using the ICT in training, I would like to compare it with synthetic queries by considering two decoding techniques, which are beam search and nucleus sampling illustrated in this paper [10].<br />
It is important to note that pre-training tasks such as BFS and WLP require a specific structure of the document. BFS randomly selects a sentence in the first section of a Wikipedia page to be used as a query and a random paragraph for that same page as the document. WLP behaves the same way as BFS for the query, but samples the paragraph for another Wikipedia page linked to the current one using a hyperlink. These constraints may slow down the adoption of these methods on general text corpora.<br />
<br />
== References ==<br />
[1] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932, 2020.<br />
<br />
[2] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.<br />
<br />
[3] Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2019.<br />
<br />
[4] Ashish Vaswani and Noam Shazeer, et al. Attention Is All You Need. arXiv:1706.03762, 2017.<br />
<br />
[5] Ahmad, Amin and Constant, Noah and Yang, Yinfei and Cer, Daniel. ReQA: An Evaluation for End-to-End Answer Retrieval Models. arXiv:1907.04780, 2019.<br />
<br />
[6] Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250, 2016.<br />
<br />
[7] Google. NQ: Natural Questions. https://ai.google.com/research/NaturalQuestions<br />
<br />
[8] Kenton Lee and Ming-Wei Chang and Kristina Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering. arXiv:1906.00300, 2019.<br />
<br />
[9] Karpukhin, V., O˘guz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. CoRR (04 2020)<br />
<br />
[10] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). (2020)</div>Pmcwhannhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training-Tasks-For-Embedding-Based-Large-Scale-RetrievalPre-Training-Tasks-For-Embedding-Based-Large-Scale-Retrieval2020-11-17T23:52:34Z<p>Pmcwhann: Created page with "This page is a summary for NIPS 2016 paper <i>Dialog-based Language Learning</i> [1]. ==Introduction== One of the ways humans learn language, especially second language or la..."</p>
<hr />
<div>This page is a summary for NIPS 2016 paper <i>Dialog-based Language Learning</i> [1].<br />
==Introduction==<br />
One of the ways humans learn language, especially second language or language learning by students, is by communication and getting its feedback. However, most existing research in Natural Language Understanding has focused on supervised learning from fixed training sets of labeled data. This kind of supervision is not realistic of how humans learn, where language is both learned by, and used for, communication. When humans act in dialogs (i.e., make speech utterances) the feedback from other human’s responses contain very rich information. This is perhaps most pronounced in a student/teacher scenario where the teacher provides positive feedback for successful communication and corrections for unsuccessful ones. <br />
<br />
This paper is about dialog-based language learning, where supervision is given naturally and implicitly in the response of the dialog partner during the conversation. This paper is a step towards the ultimate goal of being able to develop an intelligent dialog agent that can learn while conducting conversations. Specifically, this paper explores whether we can train machine learning models to learn from dialog.<br />
<br />
===Contributions of this paper===<br />
*Introduce a set of tasks that model natural feedback from a teacher and hence assess the feasibility of dialog-based language learning. <br />
*Evaluated some baseline models on this data and compared them to standard supervised learning. <br />
*Introduced a novel forward prediction model, whereby the learner tries to predict the teacher’s replies to its actions, which yields promising results, even with no reward signal at all<br />
<br />
Code for this paper can be found on Github:https://github.com/facebook/MemNN/tree/master/DBLL<br />
<br />
==Background on Memory Networks==<br />
<br/><br />
[[File:ershad_dialognetwork.png|center|center|thumb|Figure 2: end-to-end model]]<br />
<br/><br />
A memory network combines learning strategies from the machine learning literature with a memory component that can be read and written to.</div>Pmcwhannhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUESuperGLUE2020-11-16T21:43:05Z<p>Ssnaseem: </p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3] and BERT[4], have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks.<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since its release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. However, the latest GPT and BERT models far outperform these benchmarks and strike a need for a more robust and difficult benchmark.<br />
<br />
Limits to current approaches are also apparent via the GLUE suite. Performance on the GLUE diagnostic entailment dataset falls far below the average human performance of 0.80 R3 reported in the original GLUE publication, with models performing near, or even below, chance on some linguistic phenomena.<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification and fake news detection. The fine-tuned models beat many of the human labelers who weren’t experts in the domain. Thus, it creates a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Improvements to GLUE ==<br />
SuperGLUE follows the design principles of GLUE but seeks to improve on its predecessor in many ways:<br />
<br />
'''More challenging tasks:''' SuperGLUE contains the two hardest tasks in GLUE and open tasks that are difficult to current NLP approaches.<br />
<br />
'''More diverse task formats:''' SuperGLUE expands GLUE task formats to include coreference resolution and question answering.<br />
<br />
'''Comprehensive human baselines:''' Human performance estimates are provided for all benchmark tasks.<br />
<br />
'''Improved code support:''' SuperGLUE is built around the widely used tools including PyTorch and AllenNLP.<br />
<br />
'''Refined usage rules:''' SuperGLUE leaderboard ensures fair competition and full credit to creators.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether an NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgments of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks included in the benchmarks, the authors put a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above and eliminated any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to most college-educated English speakers' capabilities and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written to keep the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. The Passages are taken from seven domains including news, fiction, and historical text etc.<br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, were given a passage with a masked entity, the model should be able to predict the masked out entity from the available choices. The articles are extracted from CNN and Daily Mail.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
technologies<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
The table below briefly corresponds to the different tasks included in SuperGLUE along with the task type and size of the datasets. In the table, WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering.<br />
<br />
[[File: supergluetasks.png]]<br />
<br />
In the following chart[18], you can see the differences between the different benchmarks.<br />
[[File: superglue.JPG]]<br />
<br />
Development set examples from the tasks in SuperGLUE are shown below<br />
<br />
[[File: CaptureGlue.PNG]]<br />
<br />
<br />
===Scoring===<br />
With GLUE, they seek to give a sense of aggregate system performance overall tasks by averaging all tasks scores. Lacking a fair criterion to weigh the contributions of each task to the overall score, they opt for the simple approach of weighing each task equally and for tasks with multiple metrics, first averaging those metrics to get a task score.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperGLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset.<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowd workers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Source Code ==<br />
<br />
The source code is available at https://github.com/nyu-mll/jiant .<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.<br />
<br />
[18] Storks, Shane, Qiaozi Gao, and Joyce Y. Chai. "Recent advances in natural language inference: A survey of benchmarks, resources, and approaches." arXiv preprint arXiv:1904.01172 (2019).</div>S2sakhujhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Streaming_Bayesian_Inference_for_Crowdsourced_ClassificationStreaming Bayesian Inference for Crowdsourced Classification2020-11-16T21:03:02Z<p>Wq3zhao: /* Critique */</p>
<hr />
<div>Group 4 Paper Presentation Summary<br />
<br />
By Jonathan Chow, Nyle Dharani, Ildar Nasirov<br />
<br />
== Motivation ==<br />
Crowdsourcing can be a useful tool for data generation in classification projects by collecting annotations of large groups. Typically, this takes the form of online questions which many respondents will manually answer for payment. One example of this is Amazon's Mechanical Turk, a website where businesses (or "requesters") will remotely hire individuals(known as "Turkers" or "crowdsourcers") to perform human intelligence tasks (tasks that computers cannot do). In theory, it is effective in processing high volumes of small tasks that would be expensive to achieve in other methods.<br />
<br />
The primary limitation of this method to acquire data is that respondents can submit incorrect responses so that we couldn't ensure the data quality.<br />
<br />
Therefore, the success of crowd sourcing is limited by how well ground-truth can be determined. The primary method for doing so is probabilistic inference. However, current methods are computationally expensive, lack theoretical guarantees, or are limited to specific settings. In the meantime, there are some approaches to focus on how we sample the data from the crowd, rather than how we aggregate it to improve the accuracy of the system.<br />
<br />
== Dawid-Skene Model for Crowdsourcing ==<br />
The one-coin Dawid-Skene model is popular for contextualizing crowdsourcing problems. For task <math>i</math> in set <math>M</math>, let the ground-truth be the binary <math>y_i = {\pm 1}</math>. We get labels <math>X = {x_{ij}}</math> where <math>j \in N</math> is the index for that worker.<br />
<br />
We assume that we interact with workers in sequential fashion. At each time step <math>t</math>, a worker <math>j = a(t) </math> provides their label for an assigned task <math>i</math> and provides the label <math>x_{ij} = {\pm 1}</math>. We denote responses up to time <math>t</math> via superscript.<br />
<br />
We let <math>x_{ij} = 0</math> if worker <math>j</math> has not completed task <math>i</math>. We assume that <math>P(x_{ij} = y_i) = p_j</math>. This implies that each worker is independent and has equal probability of correctly labelling regardless of task. In crowdsourcing the data, we must determine how workers are assigned to tasks. We introduce two methods.<br />
<br />
Under uniform sampling, workers are allocated to tasks such that each task is completed by the same number of workers, rounded to the nearest integer, and no worker completes a task more than once. This policy is given by <center><math>\pi_{uni}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ | N_i^t | \}.</math></center><br />
<br />
Under uncertainty sampling, we assign more workers to tasks that are less certain. Assuming, we are able to estimate the posterior probability of ground-truth, we can allocate workers to the task with the lowest probability of falling into the predicted class. This policy is given by <center><math>\pi_{us}(t) = {\rm argmin}_{i \notin M_{a(t)}^t}\{ (max_{k \in \{\pm 1\}} ( P(y_i = k | X^t) ) \}.</math></center><br />
<br />
We then need to aggregate the data. The simple method of majority voting makes predictions for a given task based on the class the most workers have assigned it, <math>\hat{y}_i = \text{sign}\{\sum_{j \in N_i} x_{ij}\}</math>.<br />
<br />
== Streaming Bayesian Inference for Crowdsourced Classification (SBIC) ==<br />
The aim of the SBIC algorithm is to estimate the posterior probability, <math>P(y, p | X^t, \theta)</math> where <math>X^t</math> are the observed responses at time <math>t</math> and <math>\theta</math> is our prior. We can then generate predictions <math>\hat{y}^t</math> as the marginal probability over each <math>y_i</math> given <math>X^t</math>, and <math>\theta</math>.<br />
<br />
We factor <math>P(y, p | X^t, \theta) \approx \prod_{i \in M} \mu_i^t (y_i) \prod_{j \in N} \nu_j^t (p_j) </math> where <math>\mu_i^t</math> corresponds to each task and <math>\nu_j^t</math> to each worker.<br />
<br />
We then sequentially optimize the factors <math>\mu^t</math> and <math>\nu^t</math>. We begin by assuming that the worker accuracy follows a beta distribution with parameters <math>\alpha</math> and <math>\beta</math>. Initialize the task factors <math>\mu_i^0(+1) = q</math> and <math>\mu_i^0(-1) = 1 – q</math> for all <math>i</math>.<br />
<br />
When a new label is observed at time <math>t</math>, we update the <math>\nu_j^t</math> of worker <math>j</math>. We then update <math>\mu_i</math>. These updates are given by<br />
<br />
<center><math>\nu_j^t(p_j) \sim \text{Beta}(\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha, \sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(-x_{ij}) + \beta) </math></center><br />
<br />
<center><math>\mu_i^t(y_i) \propto \begin{cases} \mu_i^{t - 1}(y_i)\overline{p}_j^t & x_{ij} = y_i \\ \mu_i^{t - 1}(y_i)(1 - \overline{p}_j^t) & x_{ij} \ne y_i \end{cases}</math></center><br />
where <math>\hat{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \mu_i^{t - 1}(x_{ij}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta }</math>.<br />
<br />
We choose our predictions to be the maximum <math>\mu_i^t(k) </math> for <math>k= \{-1,1\}</math>.<br />
<br />
Depending on our ordering of labels <math>X</math>, we can select for different applications.<br />
<br />
== Fast SBIC ==<br />
The pseudocode for Fast SBIC is shown below.<br />
<br />
<center>[[Image:FastSBIC.png|800px|]]</center><br />
<br />
As the name implies, the goal of this algorithm is speed. To facilitate this, we leave the order of <math>X</math> unchanged.<br />
<br />
We express <math>\mu_i^t</math> in terms of its log-odds<br />
<center><math>z_i^t = \log(\frac{\mu_i^t(+1)}{ \mu_i^t(-1)}) = z_i^{t - 1} + x_{ij} \log(\frac{\overline{p}_j^t}{1 - \overline{p}_j^t })</math></center><br />
where <math>z_i^0 = \log(\frac{q}{1 - q})</math>.<br />
<br />
The product chain then becomes a summation and removes the need to normalize each <math>\mu_i^t</math>. We use these log-odds to compute worker accuracy,<br />
<br />
<center><math>\overline{p}_j^t = \frac{\sum_{i \in M_j^{t - 1}} \sigma(x_{ij} z_i^{t-1}) + \alpha}{|M_j^{t - 1}| + \alpha + \beta}</math></center><br />
where <math>\sigma(z_i^{t-1}) := \frac{1}{1 + exp(-z_i^{t - 1})} = \mu_i^{t - 1}(+1) </math><br />
<br />
The final predictions are made by choosing class <math>\hat{y}_i^T = \text{sign}(z_i^T) </math>. We see later that Fast SBIC has similar computational speed to majority voting.<br />
<br />
== Sorted SBIC ==<br />
To increase the accuracy of the SBIC algorithm in exchange for computational efficiency, we run the algorithm in parallel giving labels in different orders. The pseudocode for this algorithm is given below.<br />
<br />
<center>[[Image:SortedSBIC.png|800px|]]</center><br />
<br />
From the general discussion of SBIC, we know that predictions on task <math>i</math> are more accurate toward the end of the collection process. This is a result of observing more data points and having run more updates on <math>\mu_i^t</math> and <math>\nu_j^t</math> to move them further from their prior. This means that task <math>i</math> is predicted more accurately when its corresponding labels are seen closer to the end of the process.<br />
<br />
We take advantage of this property by maintaining a distinct “view” of the log-odds for each task. When a label is observed, we update views for all tasks except the one for which the label was observed. At the end of the collection process, we process skipped labels. When run online, this process must be repeated at every timestep.<br />
<br />
We see that sorted SBIC is slower than Fast SBIC by a factor of M, the number of tasks. However, we can reduce the complexity by viewing <math>s^k</math> across different tasks in an offline setting when the whole dataset is known in advance.<br />
<br />
== Theoretical Analysis ==<br />
The authors prove an exponential relationship between the error probability and the number of labels per task. This allows for an upper bound to be placed on the error, in the asymptotic regime, where it can be assumed the SBIC predictions are close to the true values. The two theorems, for the different sampling regimes, are presented below.<br />
<br />
<center>[[Image:Theorem1.png|800px|]]</center><br />
<br />
<center>[[Image:Theorem2.png|800px|]]</center><br />
<br />
== Empirical Analysis ==<br />
The purpose of the empirical analysis is to compare SBIC to the existing state of the art algorithms. The SBIC algorithm is run on five real-world binary classification datasets. The results can be found in the table below. Other algorithms in the comparison are, from left to right, majority voting, expectation-maximization, mean-field, belief-propagation, Monte-Carlo sampling, and triangular estimation. <br />
<br />
First of all, the algorithms are run on a synthetic data that meets the assumptions of an underlying one-coin Dawid-Skene model, which allows the authors to compare SBIC's performance empirically with the theoretical results previously shown. <br />
<br />
Second, the algorithms are run on real-world data. The performance of each algorithm is shown in the table below.<br />
<br />
<center>[[Image:RealWorldResults.png|800px|]]</center><br />
<br />
In bold are the best performing algorithms for each dataset. We see that both versions of the SBIC algorithm are competitive, having similar prediction errors to EM, AMF, and MC. All are considered state-of-the-art Bayesian algorithms.<br />
<br />
The figure below shows the average time required to simulate predictions on synthetic data under an uncertainty sampling policy. We see that Fast SBIC is comparable to majority voting and significantly faster than the other algorithms. This speed improvement, coupled with comparable accuracy, makes the Fast SBIC algorithm powerful.<br />
<br />
<center>[[Image:TimeRequirement.png|800px|]]</center><br />
<br />
== Conclusion and Future Research ==<br />
In conclusion, we have seen that SBIC is computationally efficient, accurate in practice, and has theoretical guarantees. The authors intend to extend the algorithm to the multi-class case in the future.<br />
<br />
== Critique ==<br />
In crowdsourcing data, the cost associated with collecting additional labels is not usually prohibitively expensive. As a result, if there is concern over ground-truth, paying for additional data to ensure X is sufficiently dense may be the desired response as opposed to sacrificing ground-truth accuracy. This may result in the SBIC algorithm being less practically useful than intended. Perhaps, a study can be done to compare the cost of acquiring additional data and checking how much it improves the accuracy of ground-truth. This can help us decide the usefulness of this algorithm. Second, SBIC should be used on multiple types of data such as audio, text, and images to ensure that it delivers consistent results.<br />
<br />
The paper is tackling the classic problem of aggregating labels in a crowdsourced application, with a focus on speed. The algorithms proposed are fast and simple to implement and come with theoretical guarantees on the bounds for error rates. However, the paper starts with an objective of designing fast label aggregation algorithms for a streaming setting yet doesn’t spend any time motivating the applications in which such algorithms are needed. All the datasets used in the empirical analysis are static datasets. Therefore for this paper to be useful, the problem considered should be well motivated. It also appears that the output from the algorithm depends on the order in which the data is processed, which may need to be clarified. Finally, the theoretical results are presented under the assumption that the predictions of the FBI converge to the ground truth; however, the reasoning behind this assumption is not explained.<br />
<br />
The paper assumes that crowdsourcing from human beings is systematic: that is, respondents to the problems would act in similar ways that can be classified into some categories. There are lots of other factors that need to be considered for a human respondent, such as fatigue effects and conflicts of interest. Those factors would seriously jeopardize the validity of the results and the model if they were not carefully designed and taken care of. For example, one formally accurate subject reacts badly to the subject one day generating lots of faulty data, and it would take lots of correct votes to even out the effects. Even in lots of medical experiment that involves human subjects, with rigorous standards and procedures, the results could still be invalid. The trade-off for speed while sacrificing the validity of results is not wise.<br />
<br />
When introducing the Streaming Bayesian Inference for Crowdsourced Classification method, it was explicitly mentioned that one can select for different applications based on the ordering of labels X. However, "applications" mentioned here were not further explained or explored to support the effectiveness and meaning of developing such an algorithm. Thus, it would be sufficient for the author to build a connection between the proposed algorithm and its real-world application to make this summary more purposeful and engaging.<br />
<br />
In the Dawid-Skene Model for Crowdsourcing part, the summary just indicated the simplest part of the data aggregation which is unusual in the real world. Instead, we can use Bayesian methods which infer the value of the latent variables y and p by estimating their posterior probability P(y,p|X,θ) given the observed data X and prior θ.<br />
<br />
As a mathematical model, which type of classification dataset(such as audio or text) Dawid-Skene Model has potential advantages, can be discussed in detail in "future research" to give readers a more intuitive experience.<br />
<br />
== References ==<br />
[1] Manino, Tran-Thanh, and Jennings. Streaming Bayesian Inference for Crowdsourced Classification. 33rd Conference on Neural Information Processing Systems, 2019</div>Jdhchowhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_ProcessesInfluenza Forecasting Framework based on Gaussian Processes2020-11-16T04:06:27Z<p>M59jiang: /* Conclusion */</p>
<hr />
<div><br />
== Abstract ==<br />
<br />
This paper presents a novel framework for seasonal epidemic forecasting using Gaussian process regression. Compared with other state-of-the-art models, the new novel framework introduced in this paper shows accurate retrospective forecasts through training a subset of the CDC influenza-like-illness (ILI) dataset using the official CDC scoring rule (log-score).<br />
<br />
== Background ==<br />
<br />
Each year, the seasonal influenza epidemic affects public health systems across the globe on a massive scale. In 2019-2020 alone, the United States reported 38 million cases, 400 000 hospitalizations, and 22 000 deaths [1]. A direct result of this is the need for reliable forecasts of future influenza development because they allow for improved public health policies and informed resource development and allocation. Many statistical methods have been developed to use data from the CDC and other real-time data sources, such as Google Trends to forecast influenza activities.<br />
<br />
Given the process of data collection and surveillance lag, accurate statistics for influenza warning systems are often delayed by some margin of time, making early prediction imperative. However, there are challenges in long-term epidemic forecasting. First, the temporal dependency is hard to capture with short-term input data. Without manually added seasonal trends, most statistical models fail to provide high accuracy. Second, the influence from other locations has not been exhaustively explored with limited data input. Spatio-temporal effects would therefore require adequate data sources to achieve good performance.<br />
<br />
== Related Work ==<br />
<br />
Given the value of epidemic forecasts, the CDC regularly publishes ILI data and has funded a seasonal ILI forecasting challenge. This challenge has led to four state-of-the-art models in the field: Historical averages, a method using the counts of past seasons to build a kernel density estimate; MSS, a physical susceptible-infected-recovered model with assumed linear noise [5]; SARIMA, a framework based on seasonal auto-regressive moving average models [2]; and LinEns, an ensemble of three linear regression models.<br />
<br />
One disadvantage of forecasting is that the publication of CDC official ILI data is usually delayed by 1-3 weeks. Therefore, several attempts to use indirect web-based data to gain more timely estimates of CDC’s ILI, which is also known as nowcasting. The forecasting framework introduced in this paper can use those nowcasting approaches to receive a more timely data stream.<br />
<br />
== Motivation ==<br />
<br />
It has been shown that LinEns forecasts outperform the other frameworks on the ILI data-set. However, this framework assumes a deterministic relationship between the epidemic week and its case count, which does not reflect the stochastic nature of the trend. Therefore, it is natural to ask whether a similar framework that assumes a stochastic relationship between these variables would provide better performance. This motivated the development of the proposed Gaussian process regression framework and the subsequent performance comparison to the benchmark models.<br />
<br />
== Gaussian Process Regression ==<br />
<br />
A Gaussian process is specified with a mean function and a kernel function. Besides, the mean and the variance function of the Gaussian process is defined by:<br />
\[\mu(x) = k(x, X)^T(K_n+\sigma^2I)^{-1}Y\]<br />
\[\sigma(x) = k^{**}(x,x)-k(x,X)^T(K_n+\sigma^2I)^{-1}k(x,X)\]<br />
where <math>K_n</math> is the covariance matrix and for the kernel, it is being specified that it will use the Gaussian kernel.<br />
<br />
Consider the following set up: let <math>X = [\mathbf{x}_1,\ldots,\mathbf{x}_n]</math> <math>(d\times n)</math> be your training data, <math>\mathbf{y} = [y_1,y_2,\ldots,y_n]^T</math> be your noisy observations where <math>y_i = f(\mathbf{x}_i) + \epsilon_i</math>, <math>(\epsilon_i:i = 1,\ldots,n)</math> i.i.d. <math>\sim \mathcal{N}(0,{\sigma}^2)</math>, and <math>f</math> is the trend we are trying to model (by <math>\hat{f}</math>). Let <math>\mathbf{x}^*</math> <math>(d\times 1)</math> be your test data point, and <math>\hat{y} = \hat{f}(\mathbf{x}^*)</math> be your predicted outcome.<br />
<br />
<br />
Instead of assuming a deterministic form of <math>f</math>, and thus of <math>\mathbf{y}</math> and <math>\hat{y}</math> (as classical linear regression would, for example), Gaussian process regression assumes <math>f</math> is stochastic. More precisely, <math>\mathbf{y}</math> and <math>\hat{y}</math> are assumed to have a joint prior distribution. Indeed, we have <br />
<br />
$$<br />
(\mathbf{y},\hat{y}) \sim \mathcal{N}(0,\Sigma(X,\mathbf{x}^*))<br />
$$<br />
<br />
where <math>\Sigma(X,\mathbf{x}^*)</math> is a matrix of covariances dependent on some kernel function <math>k</math>. In this paper, the kernel function is assumed to be Gaussian and takes the form <br />
<br />
$$<br />
k(\mathbf{x}_i,\mathbf{x}_j) = \sigma^2\exp(-\frac{1}{2}(\mathbf{x}_i-\mathbf{x}_j)^T\Sigma(\mathbf{x}_i-\mathbf{x}_j)).<br />
$$<br />
<br />
It is important to note that this gives a joint prior distribution of '''functions''' ('''Fig. 1''' left, grey curves). <br />
<br />
By restricting this distribution to contain only those functions ('''Fig. 1''' right, grey curves) that agree with the observed data points <math>\mathbf{x}</math> ('''Fig. 1''' right, solid black) we obtain the posterior distribution for <math>\hat{y}</math> which has the form<br />
<br />
$$<br />
p(\hat{y} | \mathbf{x}^*, X, \mathbf{y}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}),\sigma(\mathbf{x}^*,X))<br />
$$<br />
<br />
<br />
<div style="text-align:center;"> [[File:GPRegression.png|500px]] </div><br />
<br />
<div align="center">'''Figure 1. Gaussian process regression''': <br />
<div align="center"><br />
Select the functions from your joint prior distribution (left, grey curves) with mean <math>0</math> (left, bold line) that agree with the observed data points (right, black bullets). </div><br />
<div align="center">These form your posterior distribution (right, grey curves) with mean <math>\mu(\mathbf{x})</math> (right, bold line). Red triangle helps compare the two images (location marker) [4]. </div><br />
</div><br />
<br />
== Data-set ==<br />
<br />
Let <math>d_j^i</math> denote the number of epidemic cases recorded in week <math>j</math> of season <math>i</math>, and let <math>j^*</math> and <math>i^*</math> denote the current week and season, respectively. The ILI data-set contains <math>d_j^i</math> for all previous weeks and seasons, up to the current season with a 1-3 week publishing delay. Note that a season refers to the time of year when the epidemic is prevalent (e.g. an influenza season lasts 30 weeks and contains the last 10 weeks of year k, and the first 20 weeks of year k+1). The goal is to predict <math>\hat{y}_T = \hat{f}_T(x^*) = d^{i^*}_{j* + T}</math> where <math>T, \;(T = 1,\ldots,K)</math> is the target week (how many weeks into the future that you want to predict).<br />
<br />
To do this, a design matrix <math>X</math> is constructed where each element <math>X_{ji} = d_j^i</math> corresponds to the number of cases in week (row) j of season (column) i. The training outcomes <math>y_{i,T}, i = 1,\ldots,n</math> correspond to the number of cases that were observed in target week <math>T,\; (T = 1,\ldots,K)</math> of season <math>i, (i = 1,\ldots,n)</math>.<br />
<br />
== Proposed Framework ==<br />
<br />
To compute <math>\hat{y}</math>, the following algorithm is executed: <br />
<br />
<ol><br />
<br />
<li> Let <math>J \subseteq \{j^*-4 \leq j \leq j^*\}</math> (subset of possible weeks).<br />
<br />
<li> Assemble the Training Set <math>\{X_J, \mathbf{y}_{T,J}\}</math> .<br />
<br />
<li> Train the Gaussian process.<br />
<br />
<li> Calculate the '''distribution''' of <math>\hat{y}_{T,J}</math> using <math>p(\hat{y}_{T,J} | \mathbf{x}^*, X_J, \mathbf{y}_{T,J}) \sim \mathcal{N}(\mu(\mathbf{x}^*,X,\mathbf{y}_{T,J}),\sigma(\mathbf{x}^*,X_J))</math>.<br />
<br />
<li> Set <math>\hat{y}_{T,J} =\mu(x^*,X_J,\mathbf{y}_{T,J})</math>.<br />
<br />
<li> Repeat steps 2-5 for all sets of weeks <math>J</math>.<br />
<br />
<li> Determine the best 3 performing sets J (on the 2010/11 and 2011/12 validation sets).<br />
<br />
<li> Calculate the ensemble forecast by averaging the 3 best performing predictive distribution densities i.e. <math>\hat{y}_T = \frac{1}{3}\sum_{k=1}^3 \hat{y}_{T,J_{best}}</math>.<br />
<br />
</ol><br />
<br />
== Results ==<br />
<br />
To demonstrate the accuracy of their results, retrospective forecasting was done on the ILI data-set. Retrospective forecasting involves using information from a specific time period in the past for example a certain week of the previous year and tries to forecast future values, for example, the next week's cases. Then, the forecasting accuracy can be evaluated on the actual observed value. In other words, the Gaussian process model was trained to assume a previous season (2012/13) was the current season. In this fashion, the forecast could be compared to the already observed true outcome. <br />
<br />
To produce a forecast for the entire 2012/13 season, 30 Gaussian processes were trained (each influenza season has 30 test points <math>\mathbf{x^*}</math>) and a curve connecting the predicted outputs <math>y_T = \hat{f}(\mathbf{x^*)}</math> was plotted ('''Fig.2''', blue line). As shown in '''Fig.2''', this forecast (blue line) was reliable for both 1 (left) and 3 (right) week targets, given that the 95% prediction interval ('''Fig.2''', purple shaded) contained the true values ('''Fig.2''', red x's) 95% of the time.<br />
<br />
<div style="text-align:center;"> [[File:ResultsOne.png|600px]] </div><br />
<br />
<div align="center">'''Figure 2. Retrospective forecasts and their uncertainty''': <br />
<div align="center">One week retrospective influenza forecasting for two targets (T = 1, 3). Red x’s are the true observed values, and blue lines and purple shaded areas represent point forecasts and 95% prediction intervals, respectively. </div><br />
</div><br />
<br />
Moreover, as shown in '''Fig.3''', the novel Gaussian process regression framework outperformed all state-of-the-art models, included LinEns, for four different targets <math>(T = 1,\ldots, 4)</math>, when compared using the official CDC scoring criterion ''log-score''. Log-score describes the logarithmic probability of the forecast being within an interval around the true value. <br />
<br />
<div style="text-align:center;"> [[File:ComparisonNew.png|600px]] </div><br />
<br />
<div align="center">'''Figure 3. Average log-score gain of proposed framework''': <br />
<div align="center"> Each bar shows the mean seasonal log-score gain of the proposed framework vs. the given state-of-the-art model, and each panel corresponds to a different target week <math> T = 1,...4 </math>. </div><br />
</div><br />
<br />
== Conclusion ==<br />
<br />
This paper presented a novel framework for forecasting seasonal epidemics using Gaussian process regression that outperformed multiple state-of-the-art forecasting methods on the CDC's ILI data-set. We predict seasonal influenza and show that it can produce accurate point forecasts and accurate uncertainty quantification. Since influenza data has only been collected from 2003, therefore as more data is obtained: the training models can be further improved upon with more robust model optimizations. Hence, this work may play a key role in future influenza forecasting and as result, the improvement of public health policies and resource allocation.<br />
<br />
== Critique ==<br />
<br />
The proposed framework provides a computationally efficient method to forecast any seasonal epidemic count data that is easily extendable to multiple target types. In particular, one can compute key parameters such as the peak infection incidence <math>\left(\hat{y} = \text{max}_{0 \leq j \leq 52} d^i_j \right)</math>, the timing of the peak infection incidence <math>\left(\hat{y} = \text{argmax}_{0 \leq j \leq 52} d^i_j\right)</math> and the final epidemic size of a season <math>\left(\hat{y} = \sum_{j=1}^{52} d^i_j\right)</math>. However, given it is not a physical model, it cannot provide insights on parameters describing the disease spread. Moreover, the framework requires training data and hence, is not applicable for non-seasonal epidemics.<br />
<br />
Consider the Gaussian Process Regression, we can apply it on the COVID-19. Hence, we can add the data obtained from COVID-19 to the data-set part. Then apply the algorithm that was outlined from the project. We can predict the high wave period to better protect ourselves. But the thing is how can we obtain the training data from COVID-19 since there will be many missing values since many people may not know whether they have been affected or not. Thus, it will be a big problem to obtain the best training data.<br />
<br />
There is one Gaussian property which makes most of its models extremely tractable. A linear combination of two or more Gaussian distributions is still a Gaussian distribution. It's very easy to determine the mean and variance of the newly constructed Gaussian distribution. But there's a constraint of applying this property to several Gaussian distributions. All Gaussian distributions should be linearly independent of others, otherwise, the linear combination distribution might not be Gaussian anymore.<br />
<br />
This paper provides a state of the art approach for forecasting epidemics. It would have been interesting to see other types of kernels being used, such as a periodic kernel <math>k(x, x') = \sigma^2 \exp\left({-\frac{2 \sin^2 (\pi|x-x'|)/p}{l^2}}\right) </math>, as intuitively epidemics are known to have waves within seasons. This may have resulted in better-calibrated uncertainty estimates as well.<br />
<br />
It is mentioned that the the framework might not be good for non-seasonal epidemics because it requires training data, given that the COVID-19 pandemic comes in multiple waves and we have enough data from the first wave and second wave, we might be able to use this framework to predict the third wave and possibly the fourth one as well. It'd be interesting to see this forecasting framework being trained using the data from the first and second wave of COVID-19.<br />
<br />
Gaussian Process Regression (GPR) can be extended to COVID-19 outbreak forecast [3]. A Multi-Task Gaussian Process Regression (MTGP) model is a special case of GPR that yields multiple outputs. It was first proposed back in 2008, and its recent application was in the battery capacity prediction by Richardson et. al. The proposed MTGP model was applied on Worldwide, China, India, Italy, and USA data that covers COVID-19 statistics from Dec 31 2019 to June 25 2020. Results show that MTGP outperforms linear regression, support vector regression, random forest regression, and long short-term memory model when it comes to forecasting accuracy.<br />
<br />
A critique for GPR is that it does not take into account the physicality of the epidemic process. So it would be interesting to investigate if we can incorporate elements of physical processes in this such as using Ordinary Differential Equations (ODE) based modeling and seeing if that improves prediction accuracy or helps us explain phenomena.<br />
<br />
It would be better to have a future work section to discuss the current shortage and explore the possible improvement and applications in the future.<br />
<br />
It will be interesting to provide a comparison of methods described in this paper with optimal control theory for modelling infectious disease.<br />
<br />
This is a very nice summary and everything was discussed clearly. I think it will be interesting that apply GPR by using the data in COVID-19, maybe this will be helpful to give a sense that how likely that 3rd wave of COVID-19 will happen. Maybe it will be better to discuss more that a training data set from a location will not train a good model for other countries. (p.s. I think <math> \sigma </math> is the standard deviation, sometimes saying variance function with the notation <math> \sigma </math> confused me.).<br />
<br />
This is a very interesting paper of using GPR to forecast influenza, and that is definitely something applicable to the situation we have right now. From the benchmarking section, it seems like GPR outperforms LinEns by a narrow margin, so it would be interesting to spend some time discussing the performance difference between GPR and LinEns to see if the benefit outweighs the cost.<br />
<br />
== References ==<br />
<br />
[1] Estimated Influenza Illnesses, Medical visits, Hospitalizations, and Deaths in the United States - 2019–2020 Influenza Season. (2020). Retrieved November 16, 2020, from https://www.cdc.gov/flu/about/burden/2019-2020.html<br />
<br />
[2] Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A.,and Reich, N. G. (2017).Infectious disease prediction with kernel conditional densityestimation.Statistics in Medicine, 36(30):4908–4929.<br />
<br />
[3] Ketu, S., Mishra, P.K. Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl Intell (2020). https://doi.org/10.1007/s10489-020-01889-9<br />
<br />
[4] Schulz, E., Speekenbrink, M., and Krause, A. (2017).A tutorial on gaussian process regression with a focus onexploration-exploitation scenarios.bioRxiv.<br />
<br />
[5] Zimmer, C., Leuba, S. I., Cohen, T., and Yaesoubi, R.(2019).Accurate quantification of uncertainty in epidemicparameter estimates and predictions using stochasticcompartmental models.Statistical Methods in Medical Research,28(12):3591–3608.PMID: 30428780.<br />
<br />
[6] Bussell E. H., Dangerfield C. E., Gilligan C. A. and Cunniffe N. J. 2019Applying optimal control theory to complex epidemiological models to inform real-world disease managementPhil. Trans. R. Soc. B37420180284 http://doi.org/10.1098/rstb.2018.0284</div>D62smithhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Influenza_Forecasting_Framework_based_on_Gaussian_processes_SummaryInfluenza Forecasting Framework based on Gaussian processes Summary2020-11-16T04:02:40Z<p>D62smith: Created page with "Test"</p>
<hr />
<div>Test</div>D62smithhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_FeaturesModel Agnostic Learning of Semantic Features2020-11-15T21:45:09Z<p>P2torabi: </p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" by which similarity relationship between samples [1], [2] are used to learn a metric space in which robust and discriminative data representation are formed. However, both of these kinds of techniques work insofar as the domain shift between source and target domains is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the deep neural network (DNN) model to completely fail. Multi-domain learning (MDL) is the solution when the assumption of "source domain and target domain come from an almost identical distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been widely leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. Meta-Learning for domain generalization (MLDG) [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
Also known as "learning to learn", Model-agnostic meta-learning is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test sets, the updated weights are calculated using the algorithm below.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y}) = - \sum_{c}1[y=C]log(\hat{y}_{c})</math><br />
</div><br />
<br />
Although the task loss is a decent predictor, nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. This issue is considered in other loss terms.<br />
<br />
===Model-Agnostic Learning with Episodic Training===<br />
The key of their learning procedure is an episodic training scheme, originated from model-agnostic meta-learning, to expose the model optimization to distribution mismatch. In line with their goal of domain generalization, the model is trained on a sequence of simulated episodes with domain shift. Specifically, at each iteration, the available domains <math>D</math> are randomly split into sets of meta-train <math>D_{tr}</math> rand meta-test <math>D_{te}</math> domains. The model is trained to semantically perform well on held-out <math>D_{te}</math> after being optimized with one or more steps of gradient descent with <math>D_{tr}</math> domains. In our case, the feature extractor’s and task network’s parameters,ψ and θ, are first updated from the task-specific supervised loss <math>L</math> task(e.g. cross-entropy for classification), computed on meta-train:<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. These relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimizing symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symmetry.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|600px|center]]<br />
<br />
The training architecture and three losses are also illustrated as below:<br />
<br />
[[File:Ashraf99.png|800px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. This dataset contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:Figure2 image.png|600px|center]]<br />
<div align="center">t-SNE visualizations of extracted features.</div> <br />
<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, the authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. The authors attempted to segment the brain images into four classes: background, grey matter, white matter, and cerebrospinal fluid. Tasks such as these have enormous impact in clinical diagnosis and aiding in treatment. For example, designing a similar net to segment between healthy brain tissue and tumorous brain tissue could aid surgeons in brain tumour resection.<br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
The work proposes a new domain generalization technique by taking the advantage of global and local constraints for learning semantic feature spaces, which outperforms the state-of-the-art. The power and effectiveness of this method has been demonstrated using two domain generalization benchmarks and a real clinical dataset (MRI image segmentation). The code is publicly available at [19]. As future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== Critiques ==<br />
<br />
The purpose of this paper is to help guide learning in semantic feature space by leveraging local similarity. The authors argument may contain essential domain-independent general knowledge for domain generalization to solve this issue. In addition to adopting constructive loss and triplet loss to encourage the clustering for solving this issue. Extracting robust semantic features regardless of domains can be learned by leveraging from the across-domain class similarity information, which is important information during learning. The learner would suffer from indistinct decision boundaries if it could not separate the samples from different source domains with separation on the domain invariant feature space and in-dependent class-specific cohesion. The major problem that will be revealed with large datasets is that these indistinct decision boundaries might still be sensitive to the unseen target domain.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>Msikarouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Describtion_of_Text_MiningDescribtion of Text Mining2020-11-15T17:13:14Z<p>J27ni: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Yawen Wang, Danmeng Cui, Zijie Jiang, Mingkang Jiang, Haotian Ren, Haris Bin Zahid<br />
<br />
== Introduction ==<br />
This paper focuses on the different text mining techniques and the applications of text mining in the healthcare and biomedical domain. The text mining field has been popular as a result of the amount of text data that is available in different forms by devising patterns and trends by means such as statistical pattern learning. The text data is bound to grow even more in 2020, indicating a 50 times growth since 2010. Text is unstructured information, which is easy for humans to construct and understand but difficult for machines. Hence, there is a need to design algorithms to effectively process this avalanche of text. To further explore the text mining field, the related text mining approaches can be considered. The different text mining approaches relate to two main methods: knowledge delivery and traditional data mining methods. <br />
<br />
The authors note that knowledge delivery methods involve the application of different steps to a specific data set to create specific patterns. Research in knowledge delivery methods has evolved over the years due to advances in hardware and software technology. On the other hand, data mining has experienced substantial development through the intersection of three fields: databases, machine learning, and statistics. As brought out by the authors, text mining approaches focus on the exploration of information from a specific text. The information explored is in the form of structured, semi-structured, and unstructured text. It is important to note that text mining covers different sets of algorithms and topics that include information retrieval. The topics and algorithms are used for analyzing different text forms.<br />
<br />
==Text Representation and Encoding ==<br />
The authors review multiple methods of preprocessing text, including 4 methods to preprocess and recognize influence and frequency of individual group of words in a document. In many text mining algorithms, one of the key components is preprocessing. Preprocessing consists of different tasks that include filtering, tokenization, stemming, and lemmatization. The first step is tokenization, where a character sequence is broken down into different words or phrases. After the breakdown, filtering is carried out to remove some words. The various word inflected forms are grouped together through lemmatization, and later, the derived roots of the derived words are obtained through stemming.<br />
<br />
'''1. Tokenization'''<br />
<br />
This process splits text (i.e. a sentence) into a single unit of words, known as tokens while removing unnecessary characters. Tokenization relies on identifying word boundaries, that is ending of a word and beginning of another word, usually separated by space. Characters such as punctuation are removed and the text is split at space characters. An example of this would be converting the string "This is my string" to "This", "is", "my", "string".<br />
<br />
'''2. Filtering'''<br />
<br />
Filtering is a process by which unnecessary words or characters are removed. Often these include punctuation, prepositions, and conjugations. The resulting corpus then contains words with maximal importance in distinguishing between classes. This also includes removing words occurring very rarely, possibly of no significant relevance, as the the words with relatively high frequency may have little information to do classification.<br />
<br />
'''3. Lemmatization'''<br />
<br />
Lemmatization is the task that considers the morphological analysis of the words, i.e where the various inflected forms of a word are converted to a single form. However, unlike in stemming (see below), we must specify the part of speech (POS) of each word, i.e its intended meaning in the given sentence or document, which can prone to human error. For example, "geese" and "goose" have the same lemma "goose", as they have the same meaning.<br />
<br />
'''4. Stemming'''<br />
<br />
Stemming extracts the roots of words. It is a language dependent process. The goal of both stemming is to reduce inflectional and related (definition wise) forms of a word to a common base form. An example of this would be changing "am", "are", or "is" to "be".<br />
<br />
'''Vector Space Model'''<br />
In this section of the paper, the authors explore the different ways in which the text can be represented on a large collection of documents. One common way of representing the documents is in the form of a bag of words. The bag of words considers the occurrences of different terms.<br />
In different text mining applications, documents are ranked and represented as vectors so as to display the significance of any word. <br />
The authors note that the three basic models used are vector space, inference network, and the probabilistic models. The vector space model is used to represent documents by converting them into vectors. In the model, a variable is used to represent each model to indicate the importance of the word in the document. <br />
<br />
The weights have 2 main models used Boolean model and TF-IDF model: <br />
'''Boolean model'''<br />
terms are assigned with a positive wij if the term appears in the document. otherwise, it will be assigned a weight of 0. <br />
<br />
'''Term Frequency - inverse document frequency (TF-IDF)'''<br />
The words are weighted using the TF-IDF scheme computed as <br />
<br />
$$<br />
q(w)=f_d(w)*\log{\frac{|D|}{f_D(w)}}<br />
$$<br />
<br />
The frequency of each term is normalized by the inverse of document frequency, which helps distinct words with low frequency is recognized its importance. Each document is represented by a vector of term weights, <math>\omega(d) = (\omega(d, w_1), \omega(d,w_2),...,\omega(d,w_v))</math>. The similarity between two documents <math>d_1, d_2</math> is commonly measured by cosine similarity:<br />
$$<br />
S(d_1,d_2) = \cos(\theta) = \frac{d_1\cdot d_2}{\sum_{i=1}^vw^2_{1i}\cdot\sum_{i=1}^vw^2_{2i}}<br />
$$<br />
<br />
== Classification ==<br />
Classification in Text Mining aims to assign predefined classes to text documents. For a set <math>\mathcal{D} = {d_1, d_2, ... d_n}</math> of documents, each <math>d_i</math> is mapped to a label <math>l_i</math> from the set <math>\mathcal{L} = {l_1, l_2, ... l_k}</math>. The goal is to find a classification model <math>f</math> such that: <math>\\</math><br />
$$<br />
f: \mathcal{D} \rightarrow \mathcal{L} \quad \quad \quad f(\mathcal{d}) = \mathcal{l}<br />
$$<br />
The author illustrates 4 different classifiers that are commonly used in text mining.<br />
<br />
<br />
'''1. Naive Bayes Classifier''' <br />
<br />
Bayes rule is used to classify new examples and select the class that has the generated result that occurs most often. <br />
Naive Bayes Classifier models the distribution of documents in each class using a probabilistic model assuming that the distribution<br />
of different terms is independent of each other. The models commonly used in this classifier tried to find the posterior probability of a class based on the distribution and assumes that the documents generated are based on a mixture model parameterized by <math>\theta</math> and compute the likelihood of a document using the sum of probabilities over all mixture component. In addition, the Naive Bayes Classifier can help get around the curse of dimensionality, which may arise with high-dimensional data, such as text. <br />
<br />
'''2. Nearest Neighbour Classifier'''<br />
<br />
Nearest Neighbour Classifier uses distance-based measures to perform the classification. The documents which belong to the same class are more likely "similar" or close to each other based on the similarity measure. The classification of the test documents is inferred from the class labels of similar documents in the training set. K-Nearest Neighbor classification is well known to suffer from the "curse of dimensionality", as the proportional volume of each $d$-sphere surrounding each datapoint compared to the volume of the sample space shrinks exponentially in $d$. <br />
<br />
'''3. Decision Tree Classifier'''<br />
<br />
A hierarchical tree of the training instances, in which a condition on the attribute value is used to divide the data hierarchically. The decision tree recursively partitions the training data set into smaller subdivisions based on a set of tests defined at each node or branch. Each node of the tree is a test of some attribute of the training instance, and each branch descending from the node corresponds to one of the values of this attribute. The conditions on the nodes are commonly defined by the terms in the text documents.<br />
<br />
'''4. Support Vector Machines'''<br />
<br />
SVM is a form of Linear Classifiers which are models that makes a classification decision based on the value of the linear combinations of the documents features. The output of a linear predictor is defined to the <math> y=\vec{a} \cdot \vec{x} + b</math> where <math>\vec{x}</math> is the normalized document word frequency vector, <math>\vec{a}</math> is a vector of coefficient and <math>b</math> is a scalar. Support Vector Machines attempts to find a linear separators between various classes. An advantage of the SVM method is it is robust to high dimensionality.<br />
<br />
== Clustering ==<br />
Clustering has been extensively studied in the context of the text as it has a wide range of applications such as visualization and document organization.<br />
<br />
Clustering algorithms are used to group similar documents and thus aid in information retrieval. Text clustering can be in different levels of granularities, where clusters can be documents, paragraphs, sentences, or terms. Since text data has numerous distance characteristics that demand the design of text-specific algorithms for the task, using a binary vector to represent the text document is simply not enough. Here are some unique properties of text representation:<br />
<br />
1. Text representation has a large dimensionality, in which the size of the vocabulary from which the documents are drawn is massive, but a document might only contain a small number of words.<br />
<br />
2. The words in the documents are usually correlated with each other. Need to take the correlation into consideration when designing algorithms.<br />
<br />
3. The number of words differs from one another of the document. Thus the document needs to be normalized first before the clustering process.<br />
<br />
Three most commonly used text clustering algorithms are presented below.<br />
<br />
<br />
'''1. Hierarchical Clustering algorithms''' <br />
<br />
Hierarchical Clustering algorithms builds a group of clusters that can be depicted as a hierarchy of clusters. The hierarchy can be constructed in top-down (divisive) or bottom-up (agglomeration). Hierarchical clustering algorithms are one of the Distanced-based clustering algorithms, i.e., using a similarity function to measure the closeness between text documents.<br />
<br />
In the top-down approach, the algorithm begins with one cluster which includes all the documents. we recursively split this cluster into sub-clusters.<br />
Here is an example of a Hierarchical Clustering algorithm, the data is to be clustered by the euclidean distance. This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step determines which elements to merge in a cluster by taking the two closest elements, according to the chosen distance.<br />
<br />
<br />
[[File:418px-Hierarchical clustering simple diagram.svg.png| 300px | center]]<br />
<br />
<br />
<div align="center">Figure 1: Hierarchical Clustering Raw Data</div><br />
<br />
<br />
<br />
[[File:250px-Clusters.svg (1).png| 200px | center]]<br />
<br />
<br />
<div align="center">Figure 2: Hierarchical Clustering Clustered Data</div><br />
<br />
A main advantage of hierarchical clustering is that the algorithm only needs to be done once for any number of clusters (ie. if an individual wishes to use a different number of clusters than originally intended, they do not need to repeat the algorithm)<br />
<br />
'''2. k-means Clustering'''<br />
<br />
k-means clustering is a partitioning algorithm that partitions n documents in the context of text data into k clusters.<br />
<br />
Input: Document D, similarity measure S, number k of cluster<br />
Output: Set of k clusters<br />
Select randomly ''k'' datapoints as starting centroids<br />
While ''not converged'' do <br />
Assign documents to the centroids based on the closest similarity<br />
Calculate the cluster centroids for all clusters<br />
return ''k clusters''<br />
<br />
The main disadvantage of k-means clustering is that it is indeed very sensitive to the initial choice of the number of k. Also, since the function is run until clusters converges, k-means clustering tends to take longer to perform than hierarchical clustering. On the other hand, advantages of k-means clustering are that it is simple to implement, the algorithm scales well to large datasets, and the results are easily interpretable.<br />
<br />
<br />
'''3. Probabilistic Clustering and Topic Models'''<br />
<br />
Topic modeling is one of the most popular probabilistic clustering algorithms in recent studies. The main idea is to create a *probabilistic generative model* for the corpus of text documents. In topic models, documents are a mixture of topics, where each topic represents a probability distribution over words.<br />
<br />
There are two main topic models:<br />
* Probabilistic Latent Semantic Analysis (pLSA)<br />
* Latent Dirichlet Allocation (LDA)<br />
<br />
The paper covers LDA in more detail. LDA is a state-of-the-art unsupervised algorithm for extracting topics from a collection of documents.<br />
<br />
Given <math>\mathcal{D} = \{d_1, d_2, \cdots, d_{|\mathcal{D}|}\}</math> is the corpus and <math>\mathcal{V} = \{w_1, w_2, \cdots, w_{|\mathcal{V}|}\}</math> is the vocabulary of the corpus. <br />
<br />
A topic is <math>z_j, 1 \leq j \leq K</math> is a multinomial probability distribution over <math>|\mathcal{V}|</math> words. <br />
<br />
The distribution of words in a given document is:<br />
<br />
<math>p(w_i|d) = \Sigma_{j=1}^K p(w_i|z_j)p(z_j|d)</math><br />
<br />
The LDA assumes the following generative process for the corpus of <math>\mathcal{D}</math><br />
* For each topic <math>k\in \{1,2,\cdots, K\}</math>, sample a word distribution <math>\phi_k \sim Dir(\beta)</math><br />
* For each document <math>d \in \{1,2,\cdots,D\}</math><br />
** Sample a topic distribution <math>\theta_d \sim Dir(\alpha)</math><br />
** For each word <math>w_n, n \in \{1,2,\cdots,N\}</math> in document <math>d</math><br />
*** Sample a topic <math>z_i \sim Mult(\theta_d)</math><br />
*** Sample a word <math>w_n \sim Mult(\phi_{z_i})</math><br />
<br />
Inference and Parameter estimation can be done by computing the posterior of the hidden variables. <br />
<br />
$$P(\varphi_{1:L}, \theta_{1:D}, z_{1:D}|w_{1:D})= \dfrac{P(\varphi_{1:L}, \theta_{1:D}, z_{1:D}}{P(w_{1:D}}$$<br />
A Gibbs sampler may be used to approximate the posterior over the topic assignments for every word. <br />
<br />
$$P(z_i = k| w_i=w, z_{-i}, w_{-i}, \alpha, \beta) = \frac{n_{k, -i}^{(d)}+\alpha}{\sum_{k'=1}^K n_{k, -i}^{(d)}+Ka} \times \frac{n_{w, -i}^{(d)} + \beta}{\sum_{w'=1}^W n_{w, -i}^{(d)}+W\beta}$$<br />
In practice, LDA is often used as a module in more complicated models and has already been applied to a wide variety of domains. In addition, many variations of LDA has been created, including supervised LDA (sLDA) and hierarchical LDA (hLDA)<br />
<br />
== Information Extraction ==<br />
Information Extraction (IE) is the process of extracting useful, structured information from unstructured or semi-structured text. It automatically extracts based on our command. <br />
<br />
For example, from the sentence “XYZ company was founded by Peter in the year of 1950”, we can identify the two following relations:<br />
<br />
1. Founderof(Peter, XYZ)<br />
<br />
2. Foundedin(1950, XYZ)<br />
<br />
IE is a crucial step in data mining and has a broad variety of applications, such as web mining and natural language processing. Among all the IE tasks, two have become increasingly important, which are name entity recognition and relation extraction.<br />
<br />
The author mentioned 4 parts that are important for Information Extraction<br />
<br />
'''1. Named Entity Recognition(NER)'''<br />
<br />
This is the process of identifying real-world entity from free text, such as "Apple Inc.", "Donald Trump", "PlayStation 5" etc. Moreover, the task is to identify the category of these entities, such as "Apple Inc." is in the category of the company, "Donald Trump" is in the category of the USA president, and "PlayStation 5" is in the category of the entertainment system. <br />
<br />
'''2. Hidden Markov Model'''<br />
<br />
Since traditional probabilistic classification does not consider the predicted labels of neighbor words, we use the Hidden Markov Model when doing Information Extraction. This model is different because it considers that the label of one word depends on the previous words that appeared. The Hidden Markov model allows us to model the situation, given a sequence of labels <math> Y= (y_1, y_2, \cdots, y_n) </math>and sequence of observations <math> X= (x_1, x_2, \cdots, x_n) </math> we get<br />
<br />
<center><br />
<math><br />
y_i \sim p(y_i|y_{i-1}) \qquad x_i \sim p(x_i|x_{i-1})<br />
</math><br />
</center><br />
<br />
'''3. Conditional Random Fields'''<br />
<br />
This is a technique that is widely used in Information Extraction. The definition of it is related to graph theory. <br />
let G = (V, E) be a graph and Yv stands for the index of the vertices in G. Then (X, Y) is a conditional random field, when the random variables Yv, conditioned on X, obey Markov property with respect to the graph, and:<br />
<math>p(Y_v |X, Y_w ,w , v) = p(Y_v |X, Y_w ,w ∼ v)</math>, where w ∼ v means w and v are neighbors in G.<br />
<br />
'''4. Relation Extraction'''<br />
<br />
This is a task of finding semantic relationships between word entities in text documents, for example in a sentence such as "Seth Curry is the brother of Stephen Curry". If there is a document including these two names, the task is to identify the relationship of these two entities. There are currently numerous techniques to perform relation extraction, but the most common is to consider it a classification problem. The problem is structured as, given two entities in that occur in a sentence classify their relation into fixed relation types.<br />
<br />
== Biomedical Application ==<br />
<br />
Text mining has several applications in the domain of biomedical sciences. The explosion of academic literature in the field has made it quite hard for scientists to keep up with novel research. This is why text mining techniques are ever so important in making the knowledge digestible.<br />
<br />
The text mining techniques are able to extract meaningful information from large data by making use of biomedical ontology, which is a compilation of a common set of terms used in an area of knowledge. The Unified Medical Language System (UMLS) is the most comprehensive such resource, consisting of definitions of biomedical jargon. Several information extraction algorithms rely on the ontology to perform tasks such as Named Entity Recognition (NER) and Relation Extraction.<br />
<br />
NER involves locating and classifying biomedical entities into meaningful categories and assigning semantic representation to those entities. The NER methods can be broadly grouped into Dictionary-based, Rule-based, and Statistical approaches. NER tasks are challenging in the biomedical domain due to three key reasons: (1) There is a continuously growing volume of semantically related entities in the biomedical domain due to continuous scientific progress, so NER systems depend on dictionaries of terms which can never be complete; (2) There are often numerous names for the same concept in the biomedical domain, such as "heart attack" and "myocardial infarction"; and (3) Acronyms and abbreviations are frequently used which makes it complicated to identify the concepts these terms express. Note that Dictionary-based approaches are therefore reserved for the most advanced NER methods. <br />
<br />
Relation extraction, on the other hand, is the process of determining relationships between the entities. This is accomplished mainly by identifying the correlation between entities through analyzing the frequency of terms, as well as rules defined by domain experts. Moreover, modern algorithms are also able to summarize large documents and answer natural language questions posed by humans.<br />
<br />
Summarization is a common biomedical text mining task that largely utilizes information extraction tasks. The idea is the automatically identify significant aspects of documents and represent them in a coherent fashion. However, evaluating summarization methods becomes very difficult since deciding whether a summary is "good" is often subjective, although there are some automatic evaluation techniques for summaries such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which compares automatically generated summaries with those created by humans.<br />
<br />
== Conclusion ==<br />
<br />
This paper gave a holistic overview of the methods and applications of text mining, particularly its relevance in the biomedical domain. It highlights several popular algorithms and summarizes them along with their advantages, limitations and some potential situations where they could be used. Because of ever-growing data, for example, the very high volume of scientific literature being produced every year, the interest in this field is massive and is bound to grow in the future.<br />
<br />
== Critiques==<br />
<br />
This is a very detailed approach to introduce some different algorithms on text mining. Since many algorithms are given, it might be a good idea to compare their performances on text mining by training them on some text data and compare them to the former baselines, to see if there exists any improvement.<br />
<br />
it is a detailed summary of the techniques used in text mining. It would be more helpful if some dataset can be included for training and testing. The algorithms were grouped by different topics so that different datasets and measurements are required.<br />
<br />
It would be better for the paper to include test accuracy for testing and training sets to support text mining is a more efficient and effective algorithm compared to other techniques. Moreover, this paper mentioned Text Mining approach can be used to extract high-quality information from videos. It is to believe that extracting from videos is much more difficult than images and texts. How is it possible to retain its test accuracy at a good level for videos?<br />
<br />
Text mining can no only impact the organizational processes, but also the ability to be competitive. Some common examples of the applications are risk management, cybercrime prevention, customer care service and contextual advertising/<br />
<br />
Preprocessing an important step to analyze text, so it might be better to have the more details about that. For example, what types of words are usually removed and show we record the relative position of each word in the sentence. If one close related sentences were split into two sentences, how can we capture their relations?<br />
<br />
The authors could give more details on the applications of text mining in the healthcare and biomedical domain. For example, how could preprocessing, classification, clustering, and information extraction process be applied to this domain. Other than introduction of existing algorithms (e.g. NER), authors can provide more information about how they performs (with a sample dataset), what are their limitations, and comparisons among different algorithms.<br />
<br />
In the preprocessing section, it seems like the authors incorrectly describe what stemming is - stemming just removes the last few letters of a word (ex. studying -> study, studies -> studi). What the authors actually describe is lemmatization which is much more informative than stemming. The down side of lemmatization is that it takes more effort to build a lemmatizer than a stemmer and even once it is built it is slow in comparison with a stemmer.<br />
<br />
One of the challenges of text mining in the biomedical field is that a lot of patient data are still in the form of paper documents. Text mining can speed up the digitization of patient data and allow for the development of disease diagnosis algorithms. It'll be interesting to see how text mining can be integrated with healthcare AI such as the doppelganger algorithm to enhance question answering accuracy. (Cresswell et al, 2018)<br />
<br />
It might be helpful if the authors discuss more about the accuracy-wise performances of some text mining techniques, especially in the healthcare and biomedical domain, given the focus. It would be interesting if more information were provided about the level of accuracy needed in order to produce reliable and actionable information in such fields. Also, in these domains, sometimes a false negative could be more harmful than a false positive, such as a clinical misdiagnosis. It might be helpful to discuss a bit more about how to combats such issues in text mining.<br />
<br />
This is a survey paper that talks about many general aspects about text mining, without going into any specific one in detail. Overall it's interesting. My first feedback is on the "Information Retrieval" section of the paper. Hidden markov model is mentioned as one of the algorithms used for IR. Yet, hidden markov makes the strong assumption that given the current state, next state is independent of all the previous states. This is a very strong assumption to make in IR, as words in a sentence usually have a very strong connection to each other. This limitation should be discussed more extensively in the paper. Also, the overall structure of the paper seems to be a bit imbalanced. It solely talks about IR's application in biomedical sciences. Yet, IR has application in many different areas and subjects.<br />
<br />
This paper surveys through multiple methods and algorithms on test mining, more specifically, information extraction, test classification, and clustering. In the Information Extraction section, four possible methods are mentioned to deal with different examples of semantic texts. In the latest studies of machine learning, it is ubiquitous to see multiple methods or algorithms are combined together to achieve better performances. For a survey paper, it will be more interesting to see some connections between the four methods, and some insights such as how we can boost the accuracy of extracting precise information by combining 2 of the 4 methods together.<br />
<br />
It would be better discuss more applications and SoTA algorithms on each tasks. It just give an application in biomedical with NER, it is too simple.<br />
<br />
The summary is well-organized and gives enough information to first-time readers about text mining and different algorithms to model the data and predict using different classifiers. However, it would be better to add comparison between each classifier since the performance is important to know.<br />
<br />
This is a great informational summary, I don't have much critiques to give. But, I wanted to point out that many modern techniques ignore so many of these interesting data transformations and preprocessing steps, since the text in its raw form provides the most information for deep models to extract features from. Specifically, we can look at ULM-Fit (https://arxiv.org/abs/1801.06146) and BERT (https://arxiv.org/abs/1810.04805) and observe very little text preprocessing outside of tokenization, and simply allowing the model to learn the necessary features from a huge corpus.<br />
<br />
It might be better to explain more about Knowledge Discovery and Data Mining in the Introduction part, such as giving the definition and the comparison between them, so that the audience can understand text mining clearer.<br />
<br />
The paper and corresponding summary seems to be more breadth-focused and extremely high-level. I think this paper could've been taken a step further by including applications of the various algorithms. For example, the task of topic modelling which is highly customizable, and preprocessing which is dependent on the domain, can be accomplished using many approaches: transforming the processed texts to feature vectors using BERT or pre-trained Word2Vec; and then applying unsupervised learning methods such as LDA and clustering. Within the biomedical application mentioned, it might be of interest to look into BioBERT, which is trained on more domain-specific texts (https://koreauniv.pure.elsevier.com/en/publications/biobert-a-pre-trained-biomedical-language-representation-model-fo). <br />
<br />
The paper summary is great as it describes very important topic in today's world and how technology must adapt to the vast increase in data creation. Particularly, it is really cool to see how machine learning is used in a multi-disciplinary manner.<br />
<br />
Text-mining from this paper is described as a very compute-intensive process. Due to "a 50 times growth since 2010" to 2020, it would be nice to have a method of scaling the data in this domain to be more prepared for an even bigger growth of data in the next decade. Thus it would have been nice if the researchers included performance metrics (both computation and classification performances) of the system with different classifiers described in this paper. Lastly, it would be nice to see comparisons of ROUGE metrics for summarization that the researchers were able to achieve using the Text Mining technique they introduced.<br />
<br />
The topic is detailed. There are many factors that could affect the results of text mining since the writing habit in terms of vocabulary and the way a person construct his/her sentence could be quite different. Also, since now is the Internet era, there are always new words invented from the Internet, and some of them might be recorded in the dictionary as a casual vocabulary. A good machine learning model should be able to keep learning new words as human and correctly distinguish text through the literature structures like metaphors，slangs and folklore.<br />
<br />
This is a very detailed survey paper. It described many classical models related to text mining. However, it might be good to at least mention some new models for further reading to be a direction of readers who are new to this area.<br />
<br />
The paper does a great job of introducing new models/classification techniques. Aside from performance metrics, what is even more interesting is how well it scales, since data mining is very often done to deal with large, unfiltered data. The paper has very little mention to how well these algorithms scale, and no mention of the performance on large scale data. Perhaps a performance metric detailing exactly that is needed.<br />
<br />
The paper is introducing a lot of very common and starting points of machine learning algorithms that are used for text mining and went into a shallow depth of the mathematics behind each algorithm. However, it does not show any real concrete tests or example to demonstrate each's ability or fitness/effectiveness of each algorithm and thus leaving it a rather incomplete introduction. <br />
<br />
The paper was well-organized to keep everything with similar intuition together within one section of the paper, however, visualization of simple examples would have been helpful. For example, for SVM, a 2d illustration of margins and decision boundary would have worked far better than explaining the projection of vectors onto hyperplanes to find the maximum distance/margin.<br />
<br />
The paper is very math heavy, so it won't hurt to include some graphs to explain the model a little more. For instance, in the Probabilistic Clustering and Topic Models section, the LDA method was introduced. In the original paper, a LDA graphical model graph was given. It vividly shows the whole structure of the model and how each component interact with one another.<br />
<br />
== References ==<br />
<br />
[1] Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering, and extraction techniques. arXiv preprint arXiv:1707.02919.<br />
<br />
[2] Cresswell, Kathrin & Cunningham-Burley, Sarah & Sheikh, Aziz. (2018). Healthcare robotics - a qualitative exploration of key challenges and future directions (Preprint). Journal of Medical Internet Research. 20. 10.2196/10410.</div>Y2872wanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_LearningAdversarial Fisher Vectors for Unsupervised Representation Learning2020-11-15T14:57:04Z<p>Jlavilez: /* Methodology */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterised variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalised Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimisation problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimise the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximised w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png|centre]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularisation term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularisation methods such as batch normalisation are used).<br />
* The order of optimising <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalised to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterised as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
=== On the Fisher metric ===<br />
<br />
The Fisher metric between two points <math>x,y \in \mathbb{R}^d</math> is defined as <math> d(x,y) = (U_x - Y_y)^T \mathcal{I}^{-1} (U_x - U_y) </math>; being a quadratic form, it is elementary to see that this distance is equivalent to the usual norm in <math>\mathbb{R}^d</math>.<br />
A notion of a Fisher distance between finite subsets <math>X,Y \subset \mathbb{R}^d</math> is: <br />
\begin{align}<br />
dist(X,Y) = \left( \frac{1}{|X|} \sum_{x \in X} U_x - \frac{1}{|Y|} \right)^T \mathcal{I}^{-1} \left( \frac{1}{|X|} \sum_{x \in X} U_x - \frac{1}{|Y|} \right)<br />
\end{align}<br />
<br />
Note, however, that this is not a metric. An alternative to extending the Fisher distance to compact subsets of <math>\mathbb{R}^d</math> is to use the Hausdorff metric from elementary real analysis. A discussion of the Hausdorff metric can be found on page 5 of Ken Davidson's Real Analysis notes: http://www.math.uwaterloo.ca/~krdavids/PM351/PMath351Notes.pdf. For completeness, given two compact subsets <math>K, L \subset \mathbb{R}^d</math> the Hausdorff metric induced by the Fisher distance is given by:<br />
\begin{align}<br />
d_H(K,L) = \max \left\lbrace \sup_{a \in K} d(a, L), \sup_{b \in L} d(b,K) \right\rbrace<br />
\end{align}<br />
<br />
It is readily seen that this is a complete metric on the metric space of compact subsets of <math>\mathbb{R}^d</math>. The analysis performed in the paper might be able to be extended to the choice of this metric.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalised density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it uses the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrised sampler. In some works, regularisation (by noise addition, penalising gradients, spectral normalisation) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work. The following equation show how this method does not always sample from the generator but a small proportion (with probability p) of the samples come from real examples.<br />
<br />
<div align="center">[[File:related_work_equations.png]]</div><br />
<br />
Early works showed incorporation of Fisher Information to measure similarity and this was extended to use Fisher Vector representations in case of images. Recently, Fisher Information has been used for meta learning as well. This paper explores the possibility of using Fisher Information in deep learning generative models. By using the generator as a sampler, Fisher Information can be computed even from an unnormalised density model.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterises the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalisation ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 shows the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterised MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantisation.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Shematihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Neural_ODEsNeural ODEs2020-11-14T19:56:36Z<p>M59jiang: /* Conclusions and Critiques */</p>
<hr />
<div>== Introduction ==<br />
Chen et al. propose a new class of neural networks called neural ordinary differential equations (ODEs) in their 2018 paper under the same title. Neural network models, such as residual or recurrent networks, can be generalized as a set of transformations through hidden states (a.k.a layers) <math>\mathbf{h}</math>, given by the equation <br />
<br />
<div style="text-align:center;"><math> \mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t,\theta_t) </math> (1) </div><br />
<br />
where <math>t \in \{0,...,T\}</math> and <math>\theta_t</math> corresponds to the set of parameters or weights in state <math>t</math>. It is important to note that it has been shown (Lu et al., 2017)(Haber<br />
and Ruthotto, 2017)(Ruthotto and Haber, 2018) that Equation 1 can be viewed as an Euler discretization. Given this Euler description, if the number of layers and step size between layers are taken to their limits, then Equation 1 can instead be described continuously in the form of the ODE, <br />
<br />
<div style="text-align:center;"><math> \frac{d\mathbf{h}(t)}{dt} = f(\mathbf{h}(t),t,\theta) </math> (2). </div><br />
<br />
Equation 2 now describes a network where the output layer <math>\mathbf{h}(T)</math> is generated by solving for the ODE at time <math>T</math>, given the initial value at <math>t=0</math>, where <math>\mathbf{h}(0)</math> is the input layer of the network. <br />
<br />
With a vast amount of theory and research in the field of solving ODEs numerically, there are a number of benefits to formulating the hidden state dynamics this way. One major advantage is that a continuous description of the network allows for the calculation of <math>f</math> at arbitrary intervals and locations. The authors provide an example in section five of how the neural ODE network outperforms the discretized version i.e. residual networks, by taking advantage of the continuity of <math>f</math>. A depiction of this distinction is shown in the figure below. <br />
<br />
<div style="text-align:center;"> [[File:NeuralODEs_Fig1.png|350px]] </div><br />
<br />
In section four the authors show that the single-unit bottleneck of normalizing flows can be overcome by constructing a new class of density models that incorporates the neural ODE network formulation.<br />
The next section on automatic differentiation will describe how utilizing ODE solvers allows for the calculation of gradients of the loss function without storing any of the hidden state information. This results in a very low memory requirement for neural ODE networks in comparison to traditional networks that rely on intermediate hidden state quantities for backpropagation.<br />
<br />
== Reverse-mode Automatic Differentiation of ODE Solutions ==<br />
Like most neural networks, optimizing the weight parameters <math>\theta</math> for a neural ODE network involves finding the gradient of a loss function with respect to those parameters. Differentiating in the forward direction is a simple task, however, this method is very computationally expensive and unstable, as it introduces additional numerical error. Instead, the authors suggest that the gradients can be calculated in the reverse-mode with the adjoint sensitivity method (Pontryagin et al., 1962). This "backpropagation" method solves an augmented version of the forward ODE problem but in reverse, which is something that all ODE solvers are capable of. Section 3 provides results showing that this method gives very desirable memory costs and numerical stability. <br />
<br />
The authors provide an example of the adjoint method by considering the minimization of the scalar-valued loss function <math>L</math>, which takes the solution of the ODE solver as its argument.<br />
<br />
<div style="text-align:center;">[[File:NeuralODEs_Eq1.png|700px]],</div> <br />
This minimization problem requires the calculation of <math>\frac{\partial L}{\partial \mathbf{z}(t_0)}</math> and <math>\frac{\partial L}{\partial \theta}</math>.<br />
<br />
The adjoint itself is defined as <math>\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}</math>, which describes the gradient of the loss with respect to the hidden state <math>\mathbf{z}(t)</math>. By taking the first derivative of the adjoint, another ODE arises in the form of,<br />
<br />
<div style="text-align:center;"><math>\frac{d \mathbf{a}(t)}{dt} = -\mathbf{a}(t)^T \frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \mathbf{z}}</math> (3).</div> <br />
<br />
Since the value <math>\mathbf{a}(t_0)</math> is required to minimize the loss, the ODE in equation 3 must be solved backwards in time from <math>\mathbf{a}(t_1)</math>. Solving this problem is dependent on the knowledge of the hidden state <math>\mathbf{z}(t)</math> for all <math>t</math>, which an neural ODE does not save on the forward pass. Luckily, both <math>\mathbf{a}(t)</math> and <math>\mathbf{z}(t)</math> can be calculated in reverse, at the same time, by setting up an augmented version of the dynamics and is shown in the final algorithm. Finally, the derivative <math>dL/d\theta</math> can be expressed in terms of the adjoint and the hidden state as, <br />
<br />
<div style="text-align:center;"><math> \frac{dL}{d\theta} -\int_{t_1}^{t_0} \mathbf{a}(t)^T\frac{\partial f(\mathbf{z}(t),t,\theta)}{\partial \theta}dt</math> (4).</div><br />
<br />
To obtain very inexpensive calculations of <math>\frac{\partial f}{\partial z}</math> and <math>\frac{\partial f}{\partial \theta}</math> in equation 3 and 4, automatic differentiation can be utilized. The authors present an algorithm to calculate the gradients of <math>L</math> and their dependent quantities with only one call to an ODE solver and is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Algorithm1.png|850px]]</div><br />
<br />
If the loss function has a stronger dependence on the hidden states for <math>t \neq t_0,t_1</math>, then Algorithm 1 can be modified to handle multiple calls to the ODESolve step since most ODE solvers have the capability to provide <math>z(t)</math> at arbitrary times. A visual depiction of this scenario is shown below. <br />
<br />
<div style="text-align:center;">[[File:NeuralODES Fig2.png|350px]]</div><br />
<br />
Please see the [https://arxiv.org/pdf/1806.07366.pdf#page=13 appendix] for extended versions of Algorithm 1 and detailed derivations of each equation in this section.<br />
<br />
== Replacing Residual Networks with ODEs for Supervised Learning ==<br />
Section three of the paper investigates an application of the reverse-mode differentiation described in section two, for the training of neural ODE networks on the MNIST digit data set. To solve for the forward pass in the neural ODE network, the following experiment used Adams-Moulton (AM) method, which is an implicit ODE solver. Although it has a marked improvement over explicit ODE solvers in numerical accuracy, integrating backward through the network for backpropagation is still not preferred and the adjoint sensitivity method is used to perform efficient weight optimization. The network with this "backpropagation" technique is referred to as ODE-Net in this section. <br />
<br />
=== Implementation ===<br />
A residual network (ResNet), studied by He et al. (2016), with six standard residual blocks was used as a comparative model for this experiment. The competing model, ODE-net, replaces the residual blocks of the ResNet with the AM solver. As a hybrid of the two models ResNet and ODE-net, a third network was created called RK-Net, which solves the weight optimization of the neural ODE network explicitly through backward Runge-Kutta integration. The following table shows the training and performance results of each network. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Table1.png|400px]]</div><br />
<br />
Note that <math>L</math> and <math>\tilde{L}</math> are the number of layers in ResNet and the number of function calls that the AM method makes for the two ODE networks and are effectively analogous quantities. As shown in Table 1, both of the ODE networks achieve comparable performance to that of the ResNet with a notable decrease in memory cost for ODE-net.<br />
<br />
<br />
Another interesting component of ODE networks is the ability to control the tolerance in the ODE solver used and subsequently the numerical error in the solution. <br />
<br />
<div style="text-align:center;">[[File:NeuralODEs Fig3.png|700px]]</div><br />
<br />
The tolerance of the ODE solver is represented by the color bar in Figure 3 above and notice that a variety of effects arise from adjusting this parameter. Primarily, if one was to treat the tolerance as a hyperparameter of sorts, you could tune it such that you find a balance between accuracy (Figure 3a) and computational complexity (Figure 3b). Figure 3c also provides further evidence for the benefits of the adjoint method for the backward pass in ODE-nets since there is a nearly 1:0.5 ratio of forward to backward function calls. In the ResNet and RK-Net examples, this ratio is 1:1.<br />
<br />
Additionally, the authors loosely define the concept of depth in a neural ODE network by referring to Figure 3d. Here it's evident that as you continue to train an ODE network, the number of function evaluations the ODE solver performs increases. As previously mentioned, this quantity is comparable to the network depth of a discretized network. However, as the authors note, this result should be seen as the progression of the network's complexity over training epochs, which is something we expect to increase over time.<br />
<br />
== Continuous Normalizing Flows ==<br />
<br />
Section four tackles the implementation of continuous-depth Neural Networks, but to do so, in the first part of section four the authors discuss theoretically how to establish this kind of network through the use of normalizing flows. The authors use a change of variables method presented in other works (Rezende and Mohamed, 2015), (Dinh et al., 2014), to compute the change of a probability distribution if sample points are transformed through a bijective function, <math>f</math>.<br />
<br />
<div style="text-align:center;"><math>z_1=f(z_0) \Rightarrow \log(p(z_1))=\log(p(z_0))-\log|\det\frac{\partial f}{\partial z_0}|</math></div><br />
<br />
Where p(z) is the probability distribution of the samples and <math>det\frac{\partial f}{\partial z_0}</math> is the determinant of the Jacobian which has a cubic cost in the dimension of '''z''' or the number of hidden units in the network. The authors discovered however that transforming the discrete set of hidden layers in the normalizing flow network to continuous transformations simplifies the computations significantly, due primarily to the following theorem:<br />
<br />
'''''Theorem 1:''' (Instantaneous Change of Variables). Let z(t) be a finite continuous random variable with probability p(z(t)) dependent on time. Let dz/dt=f(z(t),t) be a differential equation describing a continuous-in-time transformation of z(t). Assuming that f is uniformly Lipschitz continuous in z and continuous in t, then the change in log probability also follows a differential equation:''<br />
<br />
<div style="text-align:center;"><math>\frac{\partial \log(p(z(t)))}{\partial t}=-tr\left(\frac{df}{dz(t)}\right)</math></div><br />
<br />
The biggest advantage of using this theorem is that the trace function is a linear function, so if the dynamics of the problem, f, is represented by a sum of functions, then so is the log density. This essentially means that you can now compute flow models with only a linear cost with respect to the number of hidden units, <math>M</math>. In standard normalizing flow models, the cost is <math>O(M^3)</math>, so they will generally fit many layers with a single hidden unit in each layer.<br />
<br />
Finally the authors use these realizations to construct Continuous Normalizing Flow networks (CNFs) by specifying the parameters of the flow as a function of ''t'', ie, <math>f(z(t),t)</math>. They also use a gating mechanism for each hidden unit, <math>\frac{dz}{dt}=\sum_n \sigma_n(t)f_n(z)</math> where <math>\sigma_n(t)\in (0,1)</math> is a separate neural network which learns when to apply each dynamic <math>f_n</math>.<br />
<br />
===Implementation===<br />
<br />
The authors construct two separate types of neural networks to compare against each other, the first neural network is the standard planar Normalizing Flow network (NF) with 64 layers of single hidden units, and the second neural network is their new CNF with 64 hidden units. The NF model was trained over 500,000 iterations using RMSprop, and the CNF network was trained over 10,000 iterations using Adam(algorithm for first-order gradient-based optimization of stochastic objective functions). The loss function is <math>KL(q(x)||p(x))</math> where <math>q(x)</math> is the flow model and <math>p(x)</math> is the target probability density.<br />
<br />
One of the biggest advantages when implementing CNF is that you can train the flow parameters just by performing maximum likelihood estimation on <math>\log(q(x))</math> given <math>p(x)</math>, where <math>q(x)</math> is found via the theorem above, and then reversing the CNF to generate random samples from <math>q(x)</math>. This reversal of the CNF is done with about the same cost of the forward pass which is not able to be done in an NF network. The following two figures demonstrate the ability of CNF to generate more expressive and accurate output data as compared to standard NF networks.<br />
<br />
<div style="text-align:center;"><br />
[[Image:CNFcomparisons.png]]<br />
<br />
[[Image:CNFtransitions.png]]<br />
</div><br />
<br />
Figure 4 clearly shows that the CNF structure exhibits significantly lower loss functions than NF. In figure 5 both networks were tasked with transforming a standard Gaussian distribution into a target distribution, not only was the CNF network more accurate on the two moons target, but also the steps it took along the way were much more intuitive than the output from NF.<br />
<br />
== A Generative Latent Function Time-Series Model ==<br />
<br />
One of the largest issues at play in terms of Neural ODE networks is the fact that in many instances, data points are either very sparsely distributed, or irregularly-sampled. The latent dynamics are discretized and the observations are in the bins of fixed duration. This creates issues with missing data and ill-defined latent variables. An example of this is medical records which are only updated when a patient visits a doctor or the hospital. To solve this issue the authors had to create a generative time-series model which would be able to fill in the gaps of missing data. The authors consider each time series as a latent trajectory stemming from the initial local state <math>z_{t_0 }</math> and determined from a global set of latent parameters. Given a set of observation times and initial state, the generative model constructs points via the following sample procedure:<br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_0}∼p(z_{t_0}) <br />
</math><br />
</div> <br />
<br />
<div style="text-align:center;"><br />
<math><br />
z_{t_1},z_{t_2},\dots,z_{t_N}={\rm ODESolve}(z_{t_0},f,θ_f,t_0,...,t_N)<br />
</math><br />
</div><br />
<br />
<div style="text-align:center;"><br />
each <br />
<math><br />
x_{t_i}∼p(x│z_{t_i},θ_x)<br />
</math><br />
</div><br />
<br />
<math>f</math> is a function which outputs the gradient <math>\frac{\partial z(t)}{\partial t}=f(z(t),θ_f)</math> which is parameterized via a neural net. In order to train this latent variable model, the authors had to first encode their given data and observation times using an RNN encoder, construct the new points using the trained parameters, then decode the points back into the original space. The following figure describes this process:<br />
<br />
<div style="text-align:center;"><br />
[[Image:EncodingFigure.png]]<br />
</div><br />
<br />
Another variable which could affect the latent state of a time-series model is how often an event actually occurs. The authors solved this by parameterizing the rate of events in terms of a Poisson process. They described the set of independent observation times in an interval <math>\left[t_{start},t_{end}\right]</math> as:<br />
<br />
<div style="text-align:center;"> <br />
<math><br />
{\rm log}(p(t_1,t_2,\dots,t_N ))=\sum_{i=1}^N{\rm log}(\lambda(z(t_i)))-\int_{t_{start}}^{t_{end}}λ(z(t))dt<br />
</math><br />
</div><br />
<br />
where <math>\lambda(*)</math> is parameterized via another neural network.<br />
<br />
===Implementation===<br />
<br />
To test the effectiveness of the Latent time-series ODE model (LODE), they fit the encoder with 25 hidden units, parametrize function f with a one-layer 20 hidden unit network, and the decoder as another neural network with 20 hidden units. They compare this against a standard recurrent neural net (RNN) with 25 hidden units trained to minimize Gaussian log-likelihood. The authors tested both of these network systems on a dataset of 2-dimensional spirals which either rotated clockwise or counter-clockwise and sampled the positions of each spiral at 100 equally spaced time steps. They can then simulate irregularly timed data by taking random amounts of points without replacement from each spiral. The next two figures show the outcome of these experiments:<br />
<br />
<div style="text-align:center;"><br />
[[Image:LODEtestresults.png]] [[Image:SpiralFigure.png|The blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model]]<br />
</div><br />
<br />
In the figure on the right the blue lines represent the test data learned curves and the red lines represent the extrapolated curves predicted by each model. It is noted that the LODE performs significantly better than the standard RNN model, especially on smaller sets of data points.<br />
<br />
== Scope and Limitations ==<br />
<br />
This part mainly discusses the scope and limitations of the paper. Firstly, while "batching" the training data is a useful step in standard neural nets and can still be applied here by combining the ODEs associated with each batch, the authors found that controlling the error, in this case, may increase the number of calculations required. In practice, however, the number of calculations did not increase significantly.<br />
<br />
So long as the model proposed in this paper uses finite weights and Lipschitz nonlinearities, Picard's existence theorem (Coddington and Levinson, 1955) applies, which guarantees that the solution to the IVP exists and is unique. This theorem holds for the model presented above when the network has finite weights and uses nonlinearities in the Lipshitz class.<br />
<br />
In controlling the error amount in the model, the authors could only reduce tolerances to approximately 10−3 and 10−5 in classification and density estimation, respectively, without also degrading the computational performance.<br />
<br />
The authors believe that reconstructing state trajectories by running the dynamics backward can introduce extra numerical error. They address a possible solution to this problem by checkpointing specific time steps and storing intermediate values of z on the forward pass. Then while reconstructing, it does each part individually between checkpoints. The authors acknowledged that they informally checked this method's validity since they do not consider it a practical problem.<br />
<br />
There remain, however, areas where standard neural networks may perform better than Neural ODEs. Firstly, conventional nets can fit non-homeomorphic functions. Examples of non-homeomorphic functions are functions whose output has a smaller dimension than their input or that change the input space's topology. However, this could be handled by composing ODE nets with standard network layers. In addition, conventional nets that can be evaluated precisely with a fixed amount of computation are typically faster to train. Also, they do not require an error tolerance for a solver.<br />
<br />
== Conclusions and Critiques ==<br />
<br />
We covered the use of black-box ODE solvers as a model component and their application to initial value problems constructed from real applications. We developed a new model based on the black box for time series modeling, supervised learning, and density estimation. These models are evaluated and adaptive and allow explicit control of the trade-off between computational speed and accuracy. Finally, we exported an instant version of the variable change formula and developed a continuous-time normalized stream scaled to a larger layer size. Neural ODE Networks show promising gains in computational cost without large sacrifices in accuracy when applied to certain problems. A drawback of some of these implementations is that the ODE Neural Networks are limited by the underlying distributions of the problems they are trying to solve (requirement of Lipschitz continuity, etc.). There are plenty of further advances to be made in this field as hundreds of years of ODE theory and literature is available, so this is currently an important area of research.<br />
<br />
ODEs indeed represent an important area of applied mathematics where neural networks can be used to solve them numerically. Perhaps, a parallel area of investigation can be PDEs (Partial Differential Equations). PDEs are also widely encountered in many areas of applied mathematics, physics, social sciences, and many other fields. It will be interesting to see how neural networks can be used to solve PDEs.<br />
<br />
== More Critiques ==<br />
Table 1 shows a comparison between different implementations which is very helpful. We can see from the table that the 1-Layer MLP has the largest test error and the one with the best performance should be ODE-Net. Although it doesn't have the lowest test error (the test error of ODE-Net is 0.42% and the lowest test error is 0.41% for ResNet), it still has the least number of parameters, memory, and time. This convinced us that it can be widely used in other applications. <br />
<br />
For the last paragraph in the scope and limitations section, I guess the author wants to use the word "than" instead of using "that" in the sentence "for example, functions whose output has a smaller dimension that their input, or that change the topology of the input space."<br />
<br />
This paper covers the memory efficiency of Neural ODE Networks, but does not address runtime. In practice, most systems are bound by latency requirements more-so than memory requirements (except in edge device cases). Though it may be unreasonable to expect the authors to produce a performance-optimized implementation, it would be insightful to understand the computational bottlenecks so existing frameworks can take steps to address them. This model looks promising and practical performance is the key to enabling future research in this.<br />
<br />
The above critique also questions the need for a neural network for such a problem. This problem was studied by Brunel et al. and they presented their solution in their paper ''Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions''. While this solution also requires iteratively solving a complex optimization problem, they did not require the massive memory and runtime overhead of a neural network. For the neural network solution to demonstrate its potential, it should be including experimental comparisons with specialized ordinary differential equation algorithms instead of simply comparing with a general recurrent neural network.<br />
<br />
Table 2 shows that potential ODEs have lower predicted RMSE, and more relevant information should be provided. For example, the reason of setting the n to 30/50/100 can be simply described. And it is good to talk more about the performance of Latent ODE since the change of n does not have much impact on its RMSE.<br />
<br />
The author should have discussed memory efficiency and the complexity of computation of ODE and make comparisons with other algorithms.<br />
<br />
== References ==<br />
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. ''arXiv preprint arXiv'':1710.10121, 2017.<br />
<br />
Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. ''Inverse Problems'', 34 (1):014004, 2017.<br />
<br />
Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. ''arXiv preprint arXiv'':1804.04272, 2018.<br />
<br />
Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. ''The mathematical theory of optimal processes''. 1962.<br />
<br />
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ''European conference on computer vision'', pages 630–645. Springer, 2016b.<br />
<br />
Earl A Coddington and Norman Levinson. ''Theory of ordinary differential equations''. Tata McGrawHill Education, 1955.<br />
<br />
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ''arXiv preprint arXiv:1505.05770'', 2015.<br />
<br />
Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. ''arXiv preprint arXiv:1410.8516'', 2014.<br />
<br />
Brunel, N. J., Clairon, Q., & d’Alché-Buc, F. (2014). Parametric estimation of ordinary differential equations with orthogonality conditions. ''Journal of the American Statistical Association'', 109(505), 173-185.<br />
<br />
A. Iserles. A first course in the numerical analysis of differential equations. Cambridge Texts in Applied Mathematics, second edition, 2009</div>Jchalatuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=IPBoostIPBoost2020-11-14T05:36:36Z<p>Wq3zhao: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Casey De Vera, Solaiman Jawad<br />
<br />
== Introduction == <br />
Boosting is an important and a fairly standard technique in a classification that combines several base learners (learners which have a “low accuracy”), into one boosted learner (learner with a “high accuracy”). Pioneered by the AdaBoost approach of Freund & Schapire, in recent decades there has been extensive work on boosting procedures and analyses of their limitations.<br />
<br />
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:<br />
<br />
Let <math>T</math> represents the number of iteration and <math>t</math> represents the iterator. <br />
<br />
for <math> t= 1, \cdots, T </math> do the following:<br />
<br />
::1. Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math><br />
<br />
::2. Evaluate the performance of <math> \mu_t</math> by computing its loss.<br />
<br />
::3. Push weight of the data distribution <math> \mathcal D_t</math> towards the misclassified examples leading to <math> \mathcal D_{t+1}</math>, usually a relatively good model will have a higher weight on those data that misclassified.<br />
<br />
Finally, the learners are combined with some form of voting (e.g., soft or hard voting, averaging, thresholding).<br />
<br />
<br />
[[File:boosting.gif|200px|thumb|right]] A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function using coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. However, they can be trounced by a small number of incorrect labels, which can be quite challenging to fix. We try to solve misclassified examples by moving the weights around, which results in bad performance on unseen data (Shown by zoomed example to the right). In fact, in theory, providing the class of base learners is rich enough, a perfect strong learner can be constructed with accuracy 1; however, clearly, such a learner might not necessarily generalize well. Boosted learners can generate some quite complicated decision boundaries, much more complicated than the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line, and XGBoost was used to learn a classifier:<br />
<br />
<br />
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper, the non-convex boosting in classification using integer programming has been explored, and the real-world practicability of the approach while circumventing shortcomings of convex boosting approaches is also shown. The paper reports results that are comparable to or better than current state-of-the-art approaches.<br />
<br />
== Motivation ==<br />
<br />
In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.<br />
<br />
== IPBoost: Boosting via Integer Programming ==<br />
<br />
<br />
===Integer Program Formulation===<br />
Let <math>(x_1,y_1),\cdots, (x_N,y_N) </math> be the training set with points <math>x_i \in \mathbb{R}^d</math> and two-class labels <math>y_i \in \{\pm 1\}</math> <br />
* class of base learners: <math> \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} </math> and <math>\rho \ge 0</math> be given. <br />
* error function <math> \eta </math><br />
Our boosting model is captured by the integer programming problem. We can call this our primal problem: <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i=1,\cdots, N \\ <br />
&\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
For the error function <math>\eta</math>, three options were considered:<br />
<br />
(i) <math> \pm 1 </math> classification from learners<br />
$$ \eta_{ij} := 2\mathbb{I}[h_j(x_i) = y_i] - 1 = y_i \cdot h_j(x_i) $$<br />
(ii) class probabilities of learners<br />
$$ \eta_{ij} := 2\mathbb{P}[h_j(x_i) = y_i] - 1$$<br />
(iii) SAMME.R error function for learners<br />
$$ \frac{1}{2}y_i\log\left(\frac{\mathbb{P}[h_j(x_i) = 1]}{\mathbb{P}[h_j(x_i) = -1]}\right)$$<br />
<br />
===Solution of the IP using Column Generation===<br />
<br />
The goal of column generation is to provide an efficient way to solve the primal's linear programming relaxation by allowing the zi variables to assume fractional values. Moreover, columns, i.e., the base learners, <math> \mathcal L \subseteq [L]. </math> . are left out because there are too many to handle efficiently, and most of them will have their associated weight equal to zero in the optimal solution. A branch and bound framework is used To generate columns. Columns are created within a branch-and-bound framework leading effectively to a branch-and-bound-and-price algorithm being used; this is significantly more involved compared to column generation in linear programming. To check the optimality of an LP solution, a subproblem, called the pricing problem, is solved to identify columns with a profitable reduced cost. If such columns are found, the LP is re-optimized. Branching occurs when no profitable columns are found, but the LP solution does not satisfy the integrality conditions. Branch and price apply column generation at every node of the branch and bound tree.<br />
<br />
In reality, we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. Noisy labels are an issue because when the writer trained a model with excessive amounts of noisy labels, the performance and accuracy deteriorate significantly. However, if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case, the model cannot correctly classify these examples, and the procedure shifts more and more weight towards these bad examples. It eventually leads to a strong learner that accurately predicts the (flawed) training data; however, that does not generalize well anymore. This intuition has been formalized by [LS], who construct a "hard" training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in these boosted learners' performance; see the tables below. The more technical reason for this problem is the convexity of the loss function minimized by the boosting procedure. One can use all types of "tricks," such as early stopping, but this is not solving the fundamental problem at the end of the day.<br />
<br />
<br />
<br />
The restricted master primal problem is <br />
<br />
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j\in \mathcal L} \eta_{ij}\lambda_j+(1+\rho)z_i \ge \rho \ \ \ <br />
\forall i \in [N]\\ <br />
&\sum_{j\in \mathcal L}\lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$<br />
<br />
<br />
Its restricted dual problem is:<br />
<br />
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i<br />
\\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ <br />
&(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$<br />
<br />
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. With this pricing problem, we check whether the solution to the restricted dual is feasible. This pricing problem can be expressed as follows:<br />
<br />
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$<br />
<br />
The optimal misclassification values are determined by a branch-and-price process that branches on the variables <math> z_i </math> and solves the intermediate LPs using column generation.<br />
<br />
===Algorithm===<br />
<div style="margin-left: 3em;"><br />
<math> D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} </math>, class of base learners <math>Ω </math>, margin <math> \rho </math> <br><br />
'''Output:''' Boosted learner <math> \sum_{j∈L^∗}h_jλ_j^* </math> with base learners <math> h_j </math> and weights <math> λ_j^* </math> <br><br />
<br />
<ol><br />
<br />
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // set of local bounds and learners for open subproblems </li><br />
<li> <math> U ← \infty, L^∗ ← \emptyset </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Upper bound on optimal objective </li><br />
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li><br />
<li> &emsp; Choose and remove <math>(B,L) </math> from <math>T </math> </li><br />
<li> &emsp; '''repeat''' </li><br />
<li> &emsp; &emsp; Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li><br />
<li> &emsp; &emsp; Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Solve pricing problem. </li><br />
<li> &emsp; '''until''' <math> h_j </math> is not found </li> <br />
<li> &emsp; Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li><br />
<li> &emsp; '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li><br />
<li> &emsp; &emsp; <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Update best solution. </li><br />
<li> &emsp; '''else''' </li><br />
<li> &emsp; &emsp; Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li><br />
<li> &emsp; &emsp; Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li><br />
<li> &emsp; &emsp; Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>. &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; // Create new branching nodes. </li><br />
<li> &emsp; '''end''' if </li><br />
<li> '''end''' while </li><br />
<li> ''Optionally sparsify final solution <math>L^*</math>'' </li><br />
<br />
</ol><br />
</div><br />
<br />
===Sparsification===<br />
<br />
When building the boosting model, one of the challenges is to balance model accuracy and generalization. A sparse model is generally easier to interpret. In order to control complexity, overfitting, and generalization of the model, typically some sparsity is applied, which is easier to interpret for some applications. The goal of sparsification is to minimize the following: \begin{equation}\sum_{i=1}^{N}z_i+\sum_{i=1}^{L}\alpha_iy_i\end{equation} There are two techniques commonly used. One is early stopping, another is regularization by adding a complexity term for the learners in the objective function. Previous approaches in the context of LP-based boosting have promoted sparsity by means of cutting planes. Sparsification can be handled in the approach introduced in this paper by solving a delayed integer program using additional cutting planes.<br />
<br />
== Results and Performance ==<br />
<br />
''All tests were run on identical Linux clusters with Intel Xeon Quad Core CPUs, with 3.50GHz, 10 MB cache, and 32 GB of main memory.''<br />
<br />
<br />
The following results reflect IPBoost's performance in hard instances. Note that by hard instances, we mean a binary classification problem with predefined labels. These examples are tailored to using the ±1 classification from learners. On every hard instance sample, IPBoost significantly outperforms both LPBoost and AdaBoost (although implementations depending on the libraries used have often caused results to differ slightly). For the considered instances the best value for the margin ρ was 0.05 for LPBoost and IPBoost; AdaBoost has no margin parameter. The accuracy reported is test accuracy recorded across various different walkthroughs of the algorithm, while <math>L </math> denotes the aggregate number of learners required to find the optimal learner, N is the number of points and <math> \gamma </math> refers to the noise level.<br />
<br />
[[File:ipboostres.png|center]]<br />
<br />
<br />
<br />
For the next table, the classification instances from LIBSVM data sets available at [https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/]. We report accuracies on the test set and train set, respectively. In each case, we report the averages of the accuracies over 10 runs with a different random seed and their standard deviations. We can see IPboost again outperforming LPBoost and AdaBoost significantly. Solving Integer Programming problems is no doubt more computationally expensive than traditional boosting methods like AdaBoost. The average run time of IPBoost (for ρ = 0.05) being 1367.78 seconds, as opposed to LPBoost's 164.35 seconds and AdaBoost's 3.59 seconds reflects exactly that. However, on the flip side, we gain much better stability in our results, as well as higher scores across the board for both training and test sets.<br />
<br />
[[file:svmlibres.png|center]]<br />
<br />
<br />
== Conclusion ==<br />
<br />
IP-boosting avoids the bad performance on well-known hard classes and improves upon LP-boosting and AdaBoost on the LIBSVM instances where even a few percent improvements are valuable. The major drawback is that the run time with the current implementation is much longer. Nevertheless, the algorithm can be improved in the future by approximating the intermediate LPs and deriving the tailored heuristics that generate decent primal solutions to save time.<br />
<br />
The approach is suited very well to an offline setting in which training may take time and where even a small improvement is beneficial or when convex boosters have egregious behaviour. It can also be served as a tool to investigate the general performance of methods like this.<br />
<br />
The IPBoost algorithm added extra complexity into basic boosting models to gain slight accuracy gain while greatly increased the time spent. It is 381 times slower compared to an AdaBoost model on a small dataset which makes the actual usage of this model doubtable. If we are supplied with a larger dataset with millions of records this model would take too long to complete. The base classifier choice was XGBoost which is too complicated for a base classifier, maybe try some weaker learners such as tree stumps to compare the result with other models. In addition, this model might not be accurate compared to the model-ensembling technique where each model utilizes a different algorithm.<br />
<br />
I think IP Boost can be helpful in cases where a small improvement in accuracy is worth the additional computational effort involved. For example, there are applications in finance where predicting the movement of stock prices slightly more accurately can help traders make millions more. In such cases, investing in expensive computing systems to implement such boosting techniques can perhaps be justified.<br />
<br />
Although the IPBoost can generate better results compared to the base boosting models such as AdaBoost. The time complexity of IPBoost is much higher than other models. Hence, it will be a problem to apply IPBoost to the real-world training data since those training data will be large and it's hard to deal with. Thus, we are left with the question that this IPBoost is only a little bit more accurate than the base boosting models but a lot more time complex. So is the model really worth it?<br />
<br />
== Critique ==<br />
<br />
The introduction section should change the sentence"3. Push weight of the data ... will have a higher weight on those data that misclassified." to "... will have a higher weight on those misclassified data.<br />
<br />
The column generation method is used when there are many variables compared to the number of constraints. If there are only a few variables compared to the number of limitations, using this method will not be very efficient.It would be better if the efficiency of different boosting models is tested.<br />
<br />
In practice, AdaBoost & LP Boost are quite robust to overfitting. In a way, if you use simple weak learns stumps (1-level decision trees), then the algorithms are much less prone to overfitting. However, the noise level in the data, particularly for AdaBoost, is prone to overfitting noisy datasets. To avoid this use, regularized models (AdaBoostReg, LPBoost). Further, when in higher dimensional spaces, AdaBoost can suffer since it's just a linear combination of classifiers. You can use k-fold cross-validation to set the stopping parameter to improve this.<br />
<br />
== References ==<br />
<br />
* Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.<br />
<br />
* Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf</div>Cgkdeverhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_AnsweringDense Passage Retrieval for Open-Domain Question Answering2020-11-14T04:09:39Z<p>P2torabi: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open-domain question answering is a task that finds answers to questions from a large collection of documents. Nowadays, open-domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear (high recall), and then in stage two, a reader would read the subset and locate the answer spans (high precision). Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end question answering (QA) accuracies. An interesting area of research that has expanded in the domain of QA is to the ability of a model to also assign a confidence score for its answer to the question.<br />
<br />
= Background =<br />
The following example clearly shows what problems open-domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader. <br />
<br />
=== Mathematical Formulation ===<br />
Let's assume that we have a collection of <math> D </math> documents <math>d_1, d_2, \cdots, d_D </math> where each <math>d</math> is split into text passages of equal lengths as the basic retrieval units. Then the total number of passages is <math> M </math> in the corpus <math>\mathcal{C} = \{p_1, p_2, \ldots, p_{M}\} </math>. Each passage <math> p_i </math> is composed of a sequence of tokens <math> w^{(i)}_1, w^{(i)}_2, \cdots, w^{(i)}_{|p_i|} </math>. The task is to find a span of tokens <math> w^{(i)}_s, w^{(i)}_{s+1}, \cdots, w^{(i)}_{e} </math> from one of the passages <math> p_i </math> to answer the given question <math>q_i</math> . The corpus can contain millions of documents just to cover a wide variety of domains (e.g. Wikipedia) or even billions of them (e.g. the Web).<br />
The open-domain QA systems define the retriever as a function <math> R:(q,\mathcal{C}) \rightarrow \mathcal{C_F} </math> that takes as input a question <math>q</math> and a corpus <math>\mathcal{C}</math> and then returns a much smaller filter set of texts <math>\mathcal{C_F} \subset \mathcal{C}</math> , where <math>|\mathcal{C_F}| = k \ll |\mathcal{C}|</math>. The retriever can be evaluated by '''top-k retrieval accuracy''' that represents the fraction of relevant answers contained in <math>\mathcal{C_F}</math>.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model that encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. The authors used the following similarity metric between the question and the passage:<br />
\begin{equation}<br />
sim(q,p) = E_Q(q)^TE_P(p)<br />
\end{equation}<br />
<br />
Of course, this is not the only possible metric; any similarity metric that is decomposable is sufficient. Many of these are some sort of transformation of the Euclidean distance. The authors experimented with other similarity metrics but found comparable results so they decided to use the simplest to focus on learning better encoders.<br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768. This can be thought of as an elegant way to take the last hidden layer's weights from BERT as a representation of the input.<br />
During the inference time, they used FAISS (Johnson et al., 2017) for the similarity search, which can be efficiently applied on billions of dense vectors<br />
<br />
== Training ==<br />
The goal in training the encoders is to create a vector space where the relevant pairs of question and passages (positive passages) have a smaller distance than the irrelevant ones (negative passages) which is a metric learning problem.<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log-likelihood of the positive passages.<br />
\begin{eqnarray}<br />
&& L(q_i, p^+_i, p^-_{i,1}, \cdots, p^-_{i,n}) \label{eq:training} \\<br />
&=& -\log \frac{ e^{\mathrm{sim}(q_i, p_i^+)} }{e^{\mathrm{sim}(q_i, p_i^+)} + \sum_{j=1}^n{e^{\mathrm{sim}(q_i, p^-_{i,j})}}}. \nonumber<br />
\end{eqnarray}<br />
While positive passage selection is simple, where the passage containing the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query <math>q_i</math>, positive passage <math>p_i</math>) in a mini-batch, then the negative passages for query <math>q_i</math> are the passages <math>p_i</math> when <math>j</math> is not equal to <math>i</math>.<br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents, i.e. they removed semi-structured data, such as tables, info- boxes, lists, as well as the disambiguation pages, and then split each document into passages of length 100 words. These passages form a candidate pool. The five QA datasets described below were used to build the train data. <br />
# '''Natural Questions (NQ)''' (Kwiatkowski et al., 2019): This was designed for the end-end question-answering tasks where the questions are Google search queries and the answers are annotated texts from Wikipedia.<br />
# '''TriviaQA''' (Joshi et al., 2017): It is a collection of trivia questions and answers scraped from the Web.<br />
# '''WebQuestions (WQ)''' (Berant et al., 2013): It consists of questions selected using Google Search API where the answers are entities in Freebase<br />
# '''CuratedTREC (TREC)''' (Baudisˇ and Sˇedivy`, 2015): It is intended for the open-domain QA from unstructured corpora. It consists of TREC QA tracks and questions from other Web sources.<br />
# '''SQuAD v1.1''' (Rajpurkar et al., 2016): It is a popular benchmark dataset for reading comprehension. The annotators were given a paragraph from Wikipedia and were supposed to write questions which answers could be found in the given text.<br />
To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px | center]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
<br />
The table below compares the top-k (for k=20 or k=100) accuracies of different retrieval systems on various popular QA datasets. Top-k accuracy is the percentage of examples in the test set for which the correct outcome occurs in the k most likely outcomes as predicted by the network. As can be seen, DPR consistently outperforms BM25 on all datasets, with the exception of SQuAD. Additionally, DPR tends to perform particularly well when a smaller k value is chosen. The authors speculated that the lower performance of DPR on SQuAD was for two reasons inherent to the dataset itself. First, the SQuAD dataset has a high lexical overlap between passages and questions. Second, the dataset is biased because it is collected from a small number of Wikipedia articles - a point which has been argued by other researchers as well.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) sample efficiency, (2) in-batch negative training, (3) impact of gold passages, (4) similarity and loss, and (5) cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px | center ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have a strong impact on the model's performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called an in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are highly ranked by BM25 but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px | center]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't appreciably improve model performance.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on the Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss.<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method that ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally, DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a single server. Conversely, building an inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px | center]]<br />
[[File: CaptureDense.PNG| 600px | center]]<br />
<br />
=== Reader for End-to-End ===<br />
<br />
The reader would score the subset of <math>k</math> passages that were retrieved using DPR. Next, the reader would extract an answer span from each passage and assign a spam score. The best span from the passage with the highest selection score is chosen as the final answer. To calculate these scores linear layers followed by a softmax are applied on top of the BERT output as follows:<br />
<br />
<math>\textbf{P}_{start,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (1)<br />
<br />
<math>\textbf{P}_{end,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (2)<br />
<br />
<math>\textbf{P}_{selected}= softmax(\hat{\textbf{P}}^T_i \textbf{w}_{selected}) \in \mathbb{R}^k </math> where <math> k </math> is the number of passages. (3)<br />
<br />
Here <math> \textbf{P}_i \in \mathbb{R}^{L \times h}</math> used in (1) and (2), <math>i</math> is one of the <math> k </math> passages and <math> h </math> is the hidden layer dimension of BERT's output. Additionally, <math> \hat{\textbf{P}} = [\textbf{P}^{[CLS]}_1,...,\textbf{P}^{[CLS]}_k] \in \mathbb{R}^{h \times k}</math>. Here <math> \textbf{w}_{start},\textbf{w}_{end},\textbf{w}_{selected} </math> are learnable vectors. The span score for <math> s </math>-th starting word to the <math> t</math>-th ending word of the <math> i </math>-th passage then can be calculated as <math> P_{start,i}(s) \times P_{end,i}(t) </math> where <math> s,t </math> are used to select the corresponding index of <math> \textbf{P}_{start,i}, \textbf{P}_{end,i} </math> respectively. The passage selection score for the <math> i</math>-th passage can be computed similarly, by indexing the <math>i</math>-th element of <math>\textbf{P}_{selected}</math>.<br />
<br />
During training, from the top 100 passages, one positive and m-1 negative passages. The hyperparameter m is set to 24 for all experiments. The marginal log-likelihood is maximized as the training objective. A batch size of 16 is used for large (NQ, TriviaQA, SQuAD) and 4 for small (TREC, WQ) datasets.<br />
<br />
<br />
<br />
<br />
<br />
= Source Code =<br />
<br />
<br />
The official code for this paper is freely available at <span class="plainlinks">[https://github.com/facebookresearch/DPR "Facebook research's official repository"]</span><br />
<br />
[https://github.com/huggingface/transformers "Hugging Face"] is also a popular and standard python package for NLP tasks.<br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of the pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships but are relatively weak at capturing exact keyword matches. This paper also covers a few training techniques that are required to successfully train a dense retriever. The dense retriever performs better than a powerful Lucene-BM25 system with a significant margin and creates advanced criteria for various open-domain Question-Answer tasks. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention.<br />
<br />
= Critiques =<br />
<br />
The bag of words method is an old out-dated retriever approach. The paper could have exploited results from other state-of-the-art algorithms (e.g Golden retriever) and compared the proposed method with them. Other modern methods such as ElasticSearch are also not considered for retrieval.<br />
<br />
Since SQuAD is usually the de facto standard for comparing question-answer tasks (being the largest and most commonly used dataset), the fact that DPR does not outperform on SQuAD should raise suspicion.<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>X46yanhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Superhuman_AI_for_Multiplayer_PokerSuperhuman AI for Multiplayer Poker2020-11-13T18:30:36Z<p>S32lim: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty<br />
<br />
== Introduction ==<br />
<br />
A super-intelligence is a hypothetical agent that possesses intelligence surpassing that of the brightest and most gifted human minds. In the past two decades, most of the superhuman AI that have been built were only able to beat human players in two-player zero-sum games, where the sum of the gains and losses always equals zero. They almost dominated most of the board games in these twenty years. The most popular AI in the board games are the chess AI deep blue and the go chess AI Alpha-go. The most common strategy that the AI uses to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is a pair of strategies such that either single-player switching to any ''other'' choice of strategy (while the other player's strategy remains unchanged) will result in a lower payout for the switching player. Intuitively this is similar to a locally optimal strategy for the players but is (i) not guaranteed to exist and (ii) may not be the truly optimal strategy. An example of this is the Prisoner's dilemma, where two individuals each have the option to testify against the other or to remain silent. Although the optimal choice is to remain silent, the individuals have an incentive to act in their own self-interest which results in a less than optimal outcome.<br />
<br />
More specifically, in the game of poker, we only have AI models that can beat human players in two-player settings. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field, because there is no polynomial-time algorithm that can find a Nash equilibrium in two-player non-zero-sum games. The discovery of such an algorithm would have surprising and profound implications in computational complexity theory.<br />
<br />
In this paper, the AI which we call Pluribus is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world. The algorithm that is used is not guaranteed to converge to a Nash algorithm outside of two-player zero-sum games. However, it uses a strong strategy that is capable of consistently defeating elite human professionals. This shows that despite not having strong theoretical guarantees on performance, they are capable of applying a wider class of superhuman strategies.<br />
<br />
Knowing the basics of Texas hold'em poker may help to understand how the AI can play. Texas hold'em is played using a standard deck of 52 cards. For each hand, players are each dealt two cards, which only they can see. Once players receive their two cards, and prior to turning any other cards, players (in clockwise order) have the option to call, raise, fold, or check. Call means to match the highest bet so far, raise means to increase the bet, fold means the player wishes to leave the current hand, and check means to continue without betting any further. A player can only check if no one has bet yet this round or if they have already matched the highest bet. Once this round of betting is over three cards, called the Flop, are turned over, and the betting happens in a clockwise manner again. This is followed by revealing another card, the Turn, and then proceeding with another round of betting. Now, the final card, the River, is turned. There is one more round of betting before the players reveal their cards and determine the winner. The player with the best five-card hand (out of the 7 cards) wins the hand and the money in the pot. The following is a list of the Texas hold'em hands from best to worst:<br />
* Royal Flush (A,K,Q,J,10 all of the same suit)<br />
* Straight Flush(5 in a row from the same suit)<br />
* 4 of a kind<br />
* Full house (3 of a kind + a pair)<br />
* Flush (5 cards of the same suit)<br />
* Straight (5 cards in a row)<br />
* 3 of a kind<br />
* 2 different pairs<br />
* A pair<br />
* Highest card<br />
<br />
== Nash Equilibrium in Multiplayer Games ==<br />
<br />
Many AI has reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. Nash equilibrium has been proven to exist in all finite games and numerous infinite games but the challenge is to find the equilibrium. It is the best possible strategy and is unbeatable in two-player zero-sum games since it guarantees to not lose in expectation regardless of what the opponent is doing.<br />
<br />
To have a deeper understanding of Nash Equilibria, we must first define some basic game theory concepts. The first one being a strategic game, in-game theory a strategic game consists of a set of players, for each player a set of actions and for each player preferences (or payoffs) over the set of action profiles (set of combination of actions). With these three elements, we can model a wide variety of situations. Now a Nash Equilibrium is an action profile, with the property that no player can do better by changing their action, given that all other players' actions remain the same. A common illustration of Nash equilibria is the Prisoner's Dilemma. We also have mixed strategies and mixed strategy Nash equilibria. A mixed strategy is when instead of a player choosing an action they apply a probability distribution to their set of actions and pick randomly. Note that with mixed strategies we must look at the expected payoff of the player given the other players' strategies. Therefore a mixed strategy Nash Equilibria involves at least one player playing with a mixed strategy where no player can increase their expected payoff by changing their action, given that all other players' actions remain the same. Then we can define a pure Nash Equilibria to where no one is playing a mixed strategy. We also must be aware that a single game can have multiple pure Nash equilibria and mixed Nash equilibria. Also, Nash Equilibria are purely theoretical and depend on players acting optimally and being rational, this is not always the case with humans and we can act very irrationally. Therefore empirically we will see that games can have very unexpected outcomes and you may be able to get a better payoff if you move away from a strictly theoretical strategy and take advantage of your opponent's irrational behavior. <br />
<br />
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. At the Nash equilibrium, there is no incentive for any player to change their initial strategy, so it is a stable state of the system. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example, our player cannot win in expectation. Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, shifting away from the Nash equilibrium strategy opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.<br />
<br />
Trying to approximate a Nash equilibrium is hard in theory, and in games with more than two players, it can only find a handful of possible strategies per player. Currently, existing techniques to find ways to exploit an opponent require way too many samples and are not competitive enough outside of small games. Finding a Nash equilibrium in three or more players is a great challenge. Even we can efficiently compute a Nash equilibrium in games with more than two players, it is still highly questionable if playing the Nash equilibrium strategy is a good choice. Additionally, if each player tries to find their own version of a Nash equilibrium, we could have infinitely many strategies and each player’s version of the equilibrium might not even be a Nash equilibrium.<br />
<br />
Consider the Lemonade Stand example from Figure 1 Below. We have 4 players and the goal for each player is to find a spot in the ring that is furthest away from every other player. This way, each lemonade stand can cover as much selling region as possible and generate maximum revenue. In the left circle, we have three different Nash equilibria distinguished by different colors which would benefit everyone. The right circle is an illustration of what would happen if each player decides to calculate their own Nash equilibrium.<br />
<br />
[[File:Lemonade_Example.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Lemonade Stand Example</div><br />
<br />
From the right circle in Figure 1, we can see that when each player tries to calculate their own Nash equilibria independently, the joint strategy can hardly lead to equally-spaced players along the ring, which is not a Nash equilibrium. This shows that attempting to find a Nash equilibrium is not the best strategy outside of two-player zero-sum games, and our goal should not be focused on finding a specific game-theoretic solution. Instead, we need to focus on observations and empirical results that consistently defeat human opponents.<br />
<br />
== Theoretical Analysis ==<br />
Pluribus uses forms of abstraction to make computations scalable. To simplify the complexity due to too many decision points, some actions are eliminated from consideration and similar decision points are grouped together and treated as identical. This process is called abstraction. Pluribus uses two kinds of abstraction: '''Action abstraction and information abstraction'''. Action abstraction reduces the number of different actions the AI needs to consider. For instance, it does not consider all bet sizes (the exact number of bets it considers varies between 1 and 14 depending on the situation). Information abstraction groups together decision points that reveal similar information. For instance, the player’s cards and revealed board cards. This is only used to reason about situations on future betting rounds, never the current betting round.<br />
<br />
Pluribus uses a built-in strategy - '''“Blueprint strategy”''', which it gradually improves by searching in real-time in situations it finds itself in during the course of the game. In the first betting round, pluribus uses the initial blueprint strategy when the number of decision points is small. The blueprint strategy is computed using Monte Carlo Counterfactual Regret Minimization (MCCFR) algorithm. CFR is commonly used in imperfect information games AI which is trained by repeatedly playing against copies of itself, without any data of human or prior AI play used as input. For ease of computation of CFR in this context, poker is represented as a game tree. A game tree is a tree structure where each node represents either a player’s decision, a chance event, or a terminal outcome and edges represent actions taken. <br />
<br />
[[File:Screen_Shot_2020-11-17_at_11.57.00_PM.png| 600px |center ]]<br />
<br />
<div align="center">Figure 1: Kuhn Poker (Simpler form of Poker) </div><br />
<br />
At the start of each iteration, MCCFR stimulates a hand of poker randomly (Cards held by a player at a given time) and designates one player as the traverser of the game tree. Once that is completed, the AI reviews the decision made by the traverser at a decision point in the game and investigates whether the decision was profitable. The AI compares its decision with other actions available to the traverser at that point and also with the future hypothetical decisions that would have been made following the other available actions. To evaluate a decision, the Counterfactual Regret factor is used. This is the difference between what the traverser would have expected to receive for choosing an action and actually received on the iteration. Thus regret is a numeric value, where a positive regret indicates you regret your decision, a negative regret indicates you are happy with your decision and zero regret indicates that you are indifferent.<br />
<br />
The value of counterfactual regret for a decision is adjusted over the iterations as more scenarios or decision points are encountered. This means at the end of each iteration, the traverser’s strategy is updated so actions with higher counterfactual regret are chosen with higher probability. CFR minimizes regret over many iterations until the average strategy overall iterations converge and the average strategy is the approximated Nash equilibrium. CFR guarantees in all finite games that all counterfactual regrets grow sublinearly in the number of iterations. Pluribus uses Linear CFR in early iterations to reduce the influence of initial bad iterations i.e it assigns a weight of T to regret contributions at iteration T. This causes the influence of the first iteration to decay at a rate of <math>\frac{1}{\sum_{t=1}^Tt} = \frac{2}{T(T+1)}</math>, compared to a rate of <math>\frac{1}{T}</math> in the original CFR algorithm. This leads to the strategy of improving more quickly in practice.<br />
<br />
An additional feature of Pluribus is that in the sub games, instead of assuming that all players play according to a single strategy, Pluribus considers that each player may choose between k different strategies specialized to each player when a decision point is reached. This results in the searcher choosing a more balanced strategy. For instance, if a player never bluffs while holding the best possible hand, then the opponents would learn that fact and always fold in that scenario. To fold in that scenario is a more balanced strategy than to bet.<br />
Therefore, the blueprint strategy is produced offline for the entire game and it is gradually improved while making real-time decisions during the game.<br />
<br />
== Experimental Results ==<br />
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format. Players were anonymized with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands. <br />
<br />
The second format included one human player and 5 copies of Pluribus (1H+5AI). There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format.<br />
The performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games. Significance tests were run at a 95% significance level with one-tailed t-tests to check the profitability of Pluribus's performance.<br />
<br />
Applying AIVAT the following were the results:<br />
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"<br />
! scope="col" | Format !! scope="col" | Average mbb/game !! scope="col" | Standard Error in mbb/game !! scope="col" | P-value of being profitable <br />
|-<br />
! scope="row" | 5H+1AI <br />
| 48 || 25 || 0.028 <br />
|-<br />
! scope="row" | 1H+5AI <br />
| 32 || 15 || 0.014<br />
|}<br />
[[File:top.PNG| 950px | x450px |left]]<br />
<br />
<br />
<div align="center"><br />
<br />
"Figure 3. Performance of Pluribus in the 5 humans + 1 AI experiment. The dots show Pluribus's performance at the end of each day of play. (Top) The lines show the win rate (solid line) plus or minus the standard error (dashed lines). (Bottom) The lines show the cumulative number of mbbs won (solid line) plus or minus the standard error (dashed lines). The relatively steady performance of Pluribus over the course of the 10,000-hand experiment also suggests that the humans were unable to find exploitable weaknesses in the bot."</div> <br />
<br />
Optimal play in Pluribus looks different from well-known poker conventions: A standard convention of “limping” in poker (calling the 'big blind' rather than folding or raising) is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” (starting a round by betting when someone else ended the previous round with a call) that is dismissed by players was adopted by Pluribus much more often than played by humans and is proven to be profitable.<br />
<br />
== Discussion and Critiques ==<br />
<br />
Pluribus' Blueprint strategy and Abstraction methods effectively reduce the computational power required. Hence it was computed in 8 days and required less than 512 GB of memory, and costs about $144 to produce. This is in sharp contrast to all the other recent superhuman AI milestones for games. This is a great way the researchers have condensed down the problem to fit the current computational powers. <br />
<br />
Pluribus definitely shows that we can capture observational data and empirical results to construct a superhuman AI without requiring theoretical guarantees, this can be a baseline for future AI inventions and help in the research of AI. It would be interesting to use Pluribus's way of using a non-theoretical approach in more real-life problems such as autonomous driving or stock market trading.<br />
<br />
Extending this idea beyond two-player zero-sum games will have many applications in real life. For example, the emergence of strategies used by the superhuman AI that were seen as "non-optimal" to humans, poses interesting questions about applying such an AI to fields such as business, economics and financial markets.<br />
<br />
It should be noted that there's one interesting fact that even though a player knows all possible outcomes in a Nash Equilibrium game, he/she would still not be beneficial for changing strategies, which means that the player will still choose to keep his/her original strategy.<br />
<br />
The summary for Superhuman AI for Multiplayer Poker is very well written, with a detailed explanation of the concept, steps, and result and with a combination of visual images. However, it seems that the experiment of the study is not well designed. For example, sample selection is not strict and well defined, this could cause selection bias introduced into the result and thus making it not generalizable.<br />
<br />
Superhuman AI, while sounding superior, is actually not uncommon. There have been many endeavours on mastering poker such as the Recursive Belief-based Learning (ReBeL) by Facebook Research. They pursued a method of reinforcement learning on a partially observable Markov decision process which was inspired by the recent successes of AlphaZero. For Pluribus to demonstrate how effective it is compared to the state-of-the-art, it should run some experiments against ReBeL.<br />
<br />
This is a very interesting topic, and this summary is clear enough for readers to understand. I think this application not only can apply in poker, maybe thinking of more applications in other areas? There are many famous AI that really changing our life. For example, AlphaGo and AlphaStar, which are developed by Google DeepMind, defeated professional gamers. Discussing more this will be interesting.<br />
<br />
One of the biggest issues when applying AI to games against humans (when not all information is known, ie, opponents' cards) is the assumption is generally made that the human players are rational players which follow a certain set of "rules" based on the information that they know. This could be an issue with the fact that Pluribus has trained itself by playing itself instead of humans. While the results clearly show that Pluribus has found some kind of 'optimal' method to play, it would be interesting to see if it could actually maximize its profits by learning the trends of its human opponents over time (learning on the fly with information gained each hand while it's playing). In addition to that, the paper may discuss how human action could be changed in the game when they play with Superhuman AI. We can see that playing card games require various strategy and different people can have a different set of actions in the same game and in the same situation.<br />
<br />
One interesting software called Piosolver leverages a similar tree-based algorithm presented in the paper to recommend the move that is deemed game theory optimal (GTO). In the poker world, GTO is a play-style that is based on mathematics and is considered a "defensive" strategy. Following the rock, paper, scissors analogy from the paper, a GTO play-style is synonymous with choosing randomly between the three options, whereas an exploitative strategy involves reading a human player's tendencies and adjusting the strategy accordingly. Piosolver is used by many professional poker players to enhance their game and gain intuition on what the best move is in certain situations.<br />
<br />
Another way to train the proposed model can be a poker game with two or more AI players. That method was used by AlphaGo to train a better model. <br />
<br />
Games with various AI players would be an interesting topic, and through comparing different AI, their shortcomes could be observed and improved. More discussions on this would be of interests.<br />
<br />
Similar to Pluribis, another [https://science.sciencemag.org/content/356/6337/508 paper] discussed a different AI program, called DeepStack, which also has defeated professional poker players at a 2-player Texas hold'em variant. However, instead of finding the Nash Equilibria, DeepStack uses recursive reasoning, decomposition, and a form of intuition that is automatically learned from self-play.<br />
<br />
AI can calculate the exact odds of a specific card or set of cards dropped on the flop. The difficulty is that in poker, the player cannot see the opponent's hand. This kind of hidden information also brings other complications (such as fraud), which is more challenging for AI at the game. At the same time, using AI to train each other is also an interesting topic.<br />
<br />
In inspiration from the critique about fraud, could this AI be used to detect cheating, such as illicitly obtaining information from an opponents hand, in games between human players? For instance by comparing the decision of the humans versus that of the AI's given the same information. Or are the strategies between human players and this AI too different to make conclusions about suspicious humans plays?<br />
<br />
I think AI can possibly calculate the probability of winning each round based on the hand it gets, and modify that probability while replicating the experiments. This may help building the model more accuratly and efficiently.<br />
<br />
The summary is overall well-defined, but authors should explain more on the algorithm parts such that specify how this model perform in each step to achieve the optimal solution. Also, there is an issue such that we must have an assumption before applying this model. The assumption is human need to follow the "rules". What if this model is against with human which does not follow the "rules" then how this model will learn from it? This is a question to contemplate.<br />
<br />
This is an interesting topic discussing AI players in the game. It would be attractive if the summary can provide a brief comparison about the advantages and shortages of the AI players nowadays.<br />
<br />
It is interesting to see how AI has some strategy playing against people in Texas poker. However, this game also relies on how people behave in their body language and the way they talk. This paper focuses more on theoretical side but not 100% true in real life.<br />
<br />
There's still some aspects of strategies in poker that still aren't explicitly present in this type of AI system. However, this does appear to show something that is far more advanced that most contemporary AI in most online poker systems.<br />
<br />
This is a interesting topic. It would be interesting to know how this can be applied to applications other than games like poker.<br />
<br />
It would be better if the introduction part can be trimmed. It will be interesting to know how this superhuman AI deal with simple non-zero-sum games and how the algorithm would adjust.<br />
<br />
This is a very interesting topic, and as mentioned above, this topic can expand far beyond the topic of multiplayer poker. In the end, the paper mentions that some common strategies by human players are dismissed and some strategies that are considered weak are being adopted by the AI. It would be interesting to see how one can embed this into the process of human decision making in the future, so that we can re-evaluate our decisions with the assistance from trained AI agents.<br />
<br />
This is an interesting topic that can be applied to read outside of just playing game, such as in the field of ecnomics.<br />
<br />
== Conclusion ==<br />
<br />
As Pluribus’s strategy was not developed with any human data and was trained only by self-play, it is an unbiased and a different perspective on how optimal play can be attained.<br />
Developing a superhuman AI for multiplayer poker is a widely recognized milestone in this area and the major remaining milestone in computer poker.<br />
Pluribus’s success shows that despite the lack of known strong theoretical guarantees on performance in multiplayer games, there are large-scale, complex multiplayer imperfect information settings in which a carefully constructed self-play-with-search algorithm can produce superhuman strategies.<br />
<br />
== References ==<br />
<br />
Noam Brown and Tuomas Sandholm (July 11, 2019). Superhuman AI for multiplayer poker. Science 365.<br />
<br />
Osborne, Martin J.; Rubinstein, Ariel (July 12, 1994). A Course in Game Theory. Cambridge, MA: MIT. p. 14.<br />
<br />
Justin Sermeno. (November 17, 2020). Vanilla Counterfactual Regret Minimization for Engineers. https://justinsermeno.com/posts/cfr/#:~:text=Counterfactual%20regret%20minimization%20%28CFR%29%20is%20an%20algorithm%20that,decision.%20It%20can%20be%20positive%2C%20negative%2C%20or%20zero<br />
<br />
Brown, N., Bakhtin, A., Lerer, A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neural Information Processing Systems, 33.</div>Hhalimhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equationsPhysics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-13T16:45:27Z<p>Jlavilez: Commented on related work</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularisation techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavour to analyese, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimisation of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularisation agent, constraining the space of acceptable solutions to help the optimisation converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimisation is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimisation, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularises the optimisation, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regulariser. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimising the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimiser.<br />
<br />
==== L-BFGS optimiser [5] ====<br />
<br />
L-BFGS is an optimisation algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimise f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS.<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimise the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimisation can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimiser is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialisation and data normalisation be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimiser has been used to update parameters. Although they are more powerful that second order optimisers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimisers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
==Related Work==<br />
<br />
Finding (weak) solutions to differential equations using neural networks which take the differential operator as a loss function is not a new idea; the novelty arises from combining classical numerical methods from DE theory with the previous work in solving these differential equations.<br />
<br />
A natural question to ask is what happens in higher dimensions with regards to the curse of dimensionality. It is known that this becomes a big issue with numerical schemes such as finite element methods. A recent paper tackles this question in relation to high-dimensional versions of the Black-Scholes, Hamilton-Jacobi-Bellman, and Allen-Cahn equations [6], which can be found here: https://arxiv.org/abs/1707.02568.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that uses existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs<br />
<br />
[5] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimisation." Mathematical programming 45.1-3 (1989): 503-528.<br />
<br />
[6] Han, J., Jentzen, A., & Weinan, E. (2018). Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505-8510.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_servicesAdacompress: Adaptive compression for online computer vision services2020-11-13T08:19:17Z<p>P2torabi: </p>
<hr />
<div><br />
== Presented by == <br />
Ahmed Hussein Salamah<br />
<br />
== Introduction == <br />
<br />
Big data and deep learning have been merged to create the great success of artificial intelligence which increases the burden on the network's speed, computational complexity, and storage in many applications. In recent literature studies, deep neural networks out-performed in image classification, one of the main tasks in the computer vision domain. Recently, they tend to use different image classification models on the cloud to share the computational power between the different users as mentioned in this paper (e.g., SenseTime, Baidu Vision and Google Vision, etc.). Most of the researchers in the literature work to improve the structure and increase the depth of DNNs to achieve better performance from the point of how the features are represented and crafted using Conventional Neural Networks (CNNs). Most well-known image classification datasets (e.g. ImageNet) are compressed using JPEG which is a commonly used compression technique. JPEG is optimized for Human Visual System (HVS) but not the machines (i.e. DNNs). To be aligned with HVS the authors reconfigure the JPEG while maintaining the same classification accuracy. <br />
<br />
'''Why is image compression important?'''<br />
<br />
Image compression is crucial in deep learning because we want the image data to take up less disk space and be loaded faster. Compared to the lossless compression PNG, which preserves the original image data, JPEG is a lossy form of compression meaning some information will be lost for the benefit of an improved compression ratio. Therefore, it is important to develop deep learning model-based image compression methods that reduce data size without jeopardizing classification accuracy. Some examples of this type of image compression include the LSTM-based approach proposed by Google [9], the transformation-based method from New York University [10], and the autoencoder-based approach by Twitter [11].<br />
<br />
== Methodology ==<br />
<br />
[[File: ada-fig2.PNG | 400px | center]]<br />
<div align="center">'''Figure 1:''' Comparing to the conventional solution, the authors [1] solution can update the compression strategy based on the backend model feedback </div><br />
<br />
One of the major parameters that can be changed in the JPEG pipeline is the quantization table, which is the main source of artifacts added in the image to make lossless compression as shown in [1, 4]. The authors are motivated to change the JPEG configuration to optimize the uploading rate to different cloud computer vision services without reconfiguration of the original model and dataset. This contrasts with the authors in [2, 3, 5] where they adjust the JPEG configuration by retraining the parameters or according to the structure of the model. The lack of undefined quantization level decreases the image rate and quality but the deep learning model can still recognize it as shown in [4]. The authors in [1] used Deep Reinforcement learning (DRL) in an online manner to choose the quantization level to upload an image to the cloud for the computer vision model and this is the only approach to design an adaptive JPEG based on an ''RL mechanism''.<br />
<br />
The approach is designed based on an interactive training environment that represents any computer vision cloud service. A deep Q neural network agent is used to evaluate and predict the performance of quantization level on an uploaded image. They feed the agent with a reward function that considers two optimization parameters: accuracy and image size. It works like an iterative agent interacting with the environment. The environment is exposed to different images with different virtual redundant information that needs an adaptive solution for each image to select the suitable compression level for the model. Thus, they design an explore-exploit mechanism to train the agent on different scenery which is designed in deep Q agent as an inference-estimate-retain mechanism to restart the training procedure for each image. The authors verify their approach by providing some analysis and insight using Grad-Cam [8] by showing some patterns of how a compression level is chosen for each image with its own corresponding quality factor. Each image shows a different response when shown to a deep learning model. In general, images are more sensitive to compression if they have large smooth areas, while those with complex textures are more robust to compression.<br />
<br />
'''What is a quantization table?'''<br />
<br />
Before getting to the quantization table, we first look at the basic architecture of JPEG's baseline system. This has 4 blocks, which are the FDCT (Fast Discrete Cosine Transformation), quantizer, statistical model, and entropy encoder. The FCDT block takes an input image separated into <math> n \times n </math> blocks and applies a discrete cosine transformation creating DCT terms. These DCT terms are values from a relatively large discrete set that will be then mapped through the process of quantization to a smaller discrete set. This is accomplished with a quantization table at the quantizer block, which is designed to preserve low-frequency information at the cost of the high-frequency information. This preference for low-frequency information is made because losing high-frequency information isn't as impactful to the image when perceived by a humans visual system [12].<br />
<br />
== Problem Formulation ==<br />
<br />
The authors formulate the problem by referring to the cloud deep learning service as <math> \vec{y}_i = M(x_i)</math> to predict results <math> \vec{y}_i </math> for an input image <math> x_i </math>, and for reference input <math> x \in X_{\rm ref} </math> the output is <math> \vec{y}_{\rm ref} = M(x_{\rm ref}) </math>. It is referred <math> \vec{y}_{\rm ref} </math> as the ground truth label and also <math> \vec{y}_c = M(x_c) </math> for compressed image <math> x_{c} </math> with quality factor <math> c </math>.<br />
<br />
<br />
\begin{align} \tag{1} \label{eq:accuracy}<br />
\mathcal{A} =& \sum_{k}\min_jd(l_j, g_k) \\ <br />
& l_j \in \vec{y}_c, \quad j=1,...,5 \nonumber \\<br />
& g_k \in \vec{y}_{\rm ref}, \quad k=1, ..., {\rm length}(\vec{y}_{\rm ref}) \nonumber \\<br />
& d(x, y) = 1 \ \text{if} \ x=y \ \text{else} \ 0 \nonumber<br />
\end{align}<br />
<br />
The authors divided the used datasets according to their contextual group <math> X </math> according to [6] and they compare their results using compression ratio <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>, where <math>s_{c}</math> is the compressed size and <math>s_{\rm ref}</math> is the original size, and accuracy metric <math> \mathcal{A}_c </math> which is calculated based on the hamming distance of Top-5 of the output of softmax probabilities for both original and compressed images as shown in Eq. \eqref{eq:accuracy}. In the RL designing stage, continuous numerical vectors are represented as the input features to the DRL agent which is a Deep Q Network (DQN). The challenges of using this approach are: <br />
(1) The state space of RL is too large to cover, so the neural network is typically constructed with more convolutional and fully-connected layers. The resulting DRL agent converges slowely and the training time is prohibitive.<br />
(2) The DRL always starts with a random initial state, but it needs to find a higher reward before starting the training of the DQN. However, the sparse reward feedback resulting from a random initialization makes learning difficult.<br />
The authors solve this problem by using a pre-trained compact model called MobileNetV2 as a feature extractor <math> \mathcal{E} </math> because it is lightweight and performs well in image classification. It is fixed during training the Q Network <math> \phi </math>. The last convolution layer of <math> \mathcal{E} </math> is connected as an input to the Q Network <math>\phi </math>, so by optimizing the parameters of the Q network <math> \phi </math>, the RL agent's policy is updated.<br />
<br />
==Reinforcement learning framework==<br />
<br />
This paper [1] described the reinforcement learning problem as <math> \{\mathcal{X}, M\} </math> to be ''emulator environment'', where <math> \mathcal{X} </math> is defining the contextual information created as an input from the user <math> x </math> and <math> M </math> is the backend cloud model. Each RL frame must be defined by ''action and state'', the action is known by 10 discrete quality levels ranging from 5 to 95 by step size of 10 and the state is feature extractor's output <math> \mathcal{E}(J(\mathcal{X}, c)) </math>, where <math> J(\cdot) </math> is the JPEG output at specific quantization level <math> c </math>. They found the optimal quantization level at time <math> t </math> is <math> c_t = {\rm argmax}_cQ(\phi(\mathcal{E}(f_t)), c; \theta) </math>, where <math> Q(\phi(\mathcal{E}(f_t)), c; \theta) </math> is action-value function, <math> \theta </math> indicates the parameters of Q network <math> \phi </math>. In the training stage of RL, the goal is to minimize a loss function <math> L_i(\theta_i) = \mathbb{E}_{s, c \sim \rho (\cdot)}\Big[\big(y_i - Q(s, c; \theta_i)\big)^2 \Big] </math> that changes at each iteration <math> i </math> where <math> s = \mathcal{E}(f_t) </math> and <math>f_t</math> is the output of the JPEG, and <math> y_i = \mathbb{E}_{s' \sim \{\mathcal{X}, M\}} \big[ r + \gamma \max_{c'} Q(s', c'; \theta_{i-1}) \mid s, c \big] </math> is the target that has a probability distribution <math> \rho(s, c) </math> over sequences <math> s </math> and quality level <math> c </math> at iteration <math> i </math>, and <math> r </math> is the feedback reward. <br />
<br />
The framework get more accurate estimation from a selected action when the distance of the target and the action-value function's output <math> Q(\cdot)</math> is minimized. As a results, no feedback signal can tell that an episode has finished a condition value <math>T</math> that satisfies <math> t \geq T_{\rm start} </math> to guarantee to store enough transitions in the memory buffer <math> D </math> to train on. To create this transitions for the RL agent, random trials are collected to observe environment reaction. After fetching some trials from the environment with their corresponding rewards, this randomness is decreased as the agent is trained to minimize the loss function <math> L </math> as shown in the Algorithm below. Thus, it optimizes its actions on a minibatch from <math> \mathcal{D} </math> to be based on historical optimal experience to train the compression level predictor <math> \phi </math>. When this trained predictor <math> \phi </math> is deployed, the RL agent will drive the compression engine with the adaptive quality factor <math> c </math> corresponding to the input image <math> x_{i} </math>. <br />
<br />
The interaction between the agent and environment <math> \{\mathcal{X}, M\} </math> is evaluated using the reward function, which is formulated, by selecting an appropriate action of quality factor <math> c </math>, to be directly proportional to the accuracy metric <math> \mathcal{A}_c </math>, and inversely proportional to the compression rate <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>. As a result, the reward function is given by <math> R(\Delta s, \mathcal{A}) = \alpha \mathcal{A} - \Delta s + \beta</math>, where <math> \alpha </math> and <math> \beta </math> to form a linear combination.<br />
<br />
[[File:Alg2.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Algroithim :''' Training RL agent <math> \phi </math> in environment <math> \{\mathcal{X}, M\} </math> </div><br />
<br />
== Inference-Estimate-Retrain Mechanism ==<br />
The system diagram, AdaCompress, is shown in figure 3 in contrast to the existing modules. When the AdaCompress is deployed, the input images scenery context <math> \mathcal{X} </math> may change, in this case the RL agent’s compression selection strategy may cause the overall accuracy to decrease. So, in order to solve this issue, the estimator will be invoked with probability <math>p_{\rm est} </math>. This will be done by generating a random value <math> \xi \in (0,1) </math> and the estimator will be invoked if <math>\xi \leq p_{\rm est}</math>. Then AdaCompress will upload both the original image and the compressed image to fetch their labels. The accuracy will then be calculated and the transition, which also includes the accuracy in this step, will be stored in the memory buffer. Comparing recent the n steps' average accuracy with earliest average accuracy, the estimator will then invoke the RL training kernel to retrain if the recent average accuracy is much lower than the initial average accuracy.<br />
<br />
[[File: diagfig.png|500px|center]]<br />
<br />
The authors solved the change in the scenery at the inference phase that might cause learning to diverge by introducing '''running-estimate-retain mechanism'''. They introduced estimator with probability <math> p_{\rm est} </math> that changes in an adaptive way and it is compared a generated random value <math> \xi \in (0,1) </math>. As shown in Figure 2, Adacompression is switching between three states in an adaptive way as will be shown in the following sections.<br />
<br />
[[File:fig3.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 2:''' State Switching Policy </div><br />
<br />
=== Inference State ===<br />
The '''inference state''' is running most of the time at which the deployed RL agent is trained and used to predict the compression level <math> c </math> to be uploaded to the cloud with minimum uploading traffic load. The agent will eventually switch to the estimator stage with probability <math> p_{\rm est} </math> so it will be robust to any change in the scenery to have a stable accuracy. The <math> p_{\rm est} </math> is fixed at the inference stage but changes in an adaptive way as a function of accuracy gradient in the next stage. In '''estimator state''', there will be a trade off between the objective of reducing upload traffic and the risk of changing the scenery, an accuracy-aware dynamic <math> p'_{\rm est} </math> is designed to calculate the average accuracy <math> \mathcal{A}_n </math> after running for defined <math> N </math> steps according to Eq. \ref{eqn:accuracy_n}.<br />
\begin{align} \tag{2} \label{eqn:accuracy_n}<br />
\bar{\mathcal{A}_n} &=<br />
\begin{cases}<br />
\frac{1}{n}\sum_{i=N-n}^{N} \mathcal{A}_i & \text{ if } N \geq n \\ <br />
\frac{1}{n}\sum_{i=1}^{n} \mathcal{A}_i & \text{ if } N < n <br />
\end{cases}<br />
\end{align}<br />
===Estimator State===<br />
The '''estimator state''' is executed when <math> \xi \leq p_{\rm est} </math> is satisfied , where the uploaded traffic is increased as the both the reference image <math> x_{ref} </math> and compressed image <math> x_{i} </math> are uploaded to the cloud to calculate <math> \mathcal{A}_i </math> based on <math> \vec{y}_{\rm ref} </math> and <math> \vec{y}_i </math>. It will be stored in the memory buffer <math> \mathcal{D} </math> as a transition <math> (\phi_i, c_i, r_i, \mathcal{A}_i) </math> of trial <math>i</math>. The estimator will not be anymore suitable for the latest <math>n</math> step when the average accuracy <math> \bar{\mathcal{A}}_n </math> is lower than the earliest <math>n</math> steps of the average <math> \mathcal{A}_0 </math> in the memory buffer <math> \mathcal{D} </math>. Consequently, <math> p_{\rm est} </math> should be changed to higher value to make the estimate stage frequently happened.It is obviously should be a function in the gradient of the average accuracy <math> \bar{\mathcal{A}}_n </math> in such a way to fell the buffer memory <math> \mathcal{D} </math> with some transitions to retrain the agent at a lower average accuracy <math> \bar{\mathcal{A}}_n </math>. The authors formulate <math> p'_{\rm est} = p_{\rm est} + \omega \nabla \bar{\mathcal{A}} </math> and <math> \omega </math> is a scaling factor. Initially the estimated probability <math> p_0 </math> will be a function of <math> p_{\rm est} </math> in the general form of <math>p_{\rm est} = p_0 + \omega \sum_{i=0}^{N} \nabla \bar{\mathcal{A}_i} </math>. <br />
<br />
===Retrain State===<br />
In '''retrain state''', the RL agent is trained to adapt on the change of the input scenery on the stored transitions in the buffer memory <math> \mathcal{D} </math>. The retain stage is finished at the recent <math> n </math> steps when the average reward <math> \bar{r}_n </math> is higher than a defined <math> r_{th}</math> by the user. Afterward, a new retraining stage should be prepared by saving new next transitions after flushing the old buffer memory <math> \mathcal{D}</math>. The authors supported their compression choice for different cloud application environments by providing some insights by introducing a visualization algorithm [8] to some images with their corresponding quality factor <math> c </math>. The visualization shows that the agent chooses a certain quantization level <math> c </math> based on the visual textures in the image at the different regions. For instant, a low-quality factor is selected for the rough central region so there is a smooth area surrounded it but for the surrounding smooth region, the agent chooses a relatively higher quality rather than the central region.<br />
<br />
<br />
<br />
==Insight of RL agent’s behavior==<br />
In the inference state, the RL agent predicts a proper compression level based on the features of the input image. In the next subsection, we will see that this compression level varies for different image sets and backend cloud services. Also, by taking a look at the attention maps for some of the images, we will figure out why the agent has chosen this compression level.<br />
===Compression level choice variation===<br />
In Figure 5, for Face++ and Amazon Rekognition, the agent’s choices are mostly around compression level = 15, but for Baidu Vision, the agent’s choices are distributed more evenly. Therefore, the backend strategy really affects the choice for the optimal compression level.<br />
<br />
[[File:comp-level1.PNG|500px|center|fig: running-retrain]]<br />
In figure 6, we will see how the agent's behaviour in selecting the optimal compression level changes for different datasets. The two datasets, ImageNet and DNIM present different contextual sceneries. The images mostly taken at daytime were randomly selected from ImageNet and the images mostly taken at night time were selected from DNIM. The figure 6 shows that for DNiM images, the agent's choices are mostly concentrated in relatively high compression levels, whereas for the ImageNet dataset, the agent's choices are distributed more evenly. <br />
<br />
[[File:comp-level2.PNG|500px|center|fig: running-retrain]]<br />
<br />
<br />
<br />
<br />
<br />
<br />
== Results ==<br />
The authors reported in Figure 3, 3 different cloud services compared to the benchmark images. It is shown that more than half of the upload size while roughly preserving the top-5 accuracy calculated by using A with an average of 7% proving the efficiency of the design. In Figure 4, it shows the ''' inference-estimate-retain ''' mechanism as the x-axis indicates steps, while <math> \Delta </math> mark on <math>x</math>-axis is reveal as a change in the scenery. In Figure 4, the estimating probability <math> p_{\rm est} </math> and the accuracy are inversely proportion as the accuracy drops below the initial value the <math> p_{\rm est} </math> increase adaptive as it considers the accuracy metric <math> \mathcal{A}_c </math> each action <math> c </math> making the average accuracy to decrease in the next estimations. At the red vertical line, the scenery started to change, and <math>Q</math> Network start to retrain to adapt the agent on the current scenery. At retrain stage, the output result is always use from the reference image's prediction label <math> \vec{y}_{\rm ref} </math>. <br />
Also, they plotted the scaled uploading data size of the proposed algorithm, and the overhead data size for the benchmark is shown in the inference stage. After the average accuracy became stable and high, the transmission is reduced by decreasing the <math> p_{\rm est} </math> value. As a result, <math> p_{\rm est} </math> and <math> \mathcal{A} </math> will be always equal to 1. During this stage, the uploaded file is more than the conventional benchmark. In the inference stage, the uploaded size is halved as shown in both Figures 3, 4.<br />
[[File:upload overhead.png|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 3:''' Difference in overhead of size during training and inference phase </div><br />
<br />
[[File:ada-fig9.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 4:''' Different cloud services compared relative to average size and accuracy </div><br />
<br />
<br />
[[File:ada-fig10.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 5:''' Scenery change response from AdaCompress Algorithm </div><br />
<br />
[[File:CaptureADA.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 6:''' Latency between image upload and inference result feedback </div><br />
<br />
<br />
<br />
==Conclusion==<br />
<br />
Most of the research focused on modifying the deep learning model instead of dealing with the currently available approaches. The authors succeed in defining the compression level for each uploaded image to decrease the size and maintain the top-5 accuracy in a robust manner even the scenery is changed. <br />
In my opinion, Eq. \eqref{eq:accuracy} is not defined well as I found it does not really affect the reward function. Also, they did not use the whole training set from ImageNet which raises the question of what is the higher file size that they considered from in the mention current set. In addition, if they considered the whole data set, should we expect the same performance for the mechanism or better. I believe it would be better in both accuracy and compression.<br />
<br />
== Critiques == <br />
<br />
The authors used a pre-trained model as a feature extractor to select a Quality Factor (QF) for the JPEG. I think what would be missing that they did not report the distribution of each of their span of QFs as it is important to understand which one is expected to contribute more to the whole datasets used. The authors did not run their approach on a complete database like ImageNet, they only included a part of two different datasets. I know they might have limitations in the available datasets to test like CIFARs, as they are not totally comparable from the resolution perspective for the real online computer vision services work with higher resolutions. <br />
In the next section, I have done one experiment using Inception-V3 to understand if it is possible to get better accuracy. I found that it is possible by using the inception model as a pre-trained model to choose a lower QF, but as well known that the mobile models are shallower than the inception models which makes it less complex to run on edge devices. I think it is possible to achieve at least the same accuracy or even more if we replaced the mobile model with the inception as shown in the section.<br />
<br />
=== Extra Analysis ===<br />
In the following figure, I took a single image from ImageNet with the ground truth of ''' Sea Sneak ''' and encoded it with a QF of 20. I run the inference on Inception V3 that is benchmarked by TensorFlow. The Human Visual System (HVS) can not recognize it. Therefore, we should expect that the trained model will not get the right ground truth as the same model is aligned with the HVS as they both trained on the same perception. In table 1, showing that the compressed image will be more recognizable by the machine than the human to be within the top-5 accuracy, where we expected that the machine will not be to recognize it. This means that the machine has a different perception than the Human visual system.<br />
<br />
<br />
<br/><br />
[[File:adacomp_sea_snake.jpg|500px|center]]<br />
<br/><br />
<div align="center">'''Figure 5:''' Sea Snake Image from ImageNet compressed with QF = 20 </div><br />
<br />
<br/><br />
[[File:adacomp table 3.PNG|500px|center]]<br />
<br/><br />
<div align="center">'''Table 1:''' Sea Snake Image prediction probability using the original image and the compressed one</div><br />
<br />
The paper succeeds in reducing both the average upload size and overall latency. This has many practical applications in edge computing. The extra analysis section is very interesting. It seems that the model successfully recognizes aspects of the image that humans cannot interpret. It would be interesting to look at the actual activations of the neural network to see if there are some intuitive features that the model learns.<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AhmedHussKhalifa/AdaCompress<br />
<br />
== References ==<br />
<br />
[1] Hongshan Li, Yu Guo, Zhi Wang, Shutao Xia, and Wenwu Zhu, “Adacompress: Adaptive compression for online computer vision services,” in Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 2019, MM ’19, pp. 2440–2448, ACM.<br />
<br />
[2] Zihao Liu, Tao Liu, Wujie Wen, Lei Jiang, Jie Xu, Yanzhi Wang, and Gang Quan, “DeepN-JPEG: A deep neural network favorable JPEG-based image compression<br />
framework,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 18.<br />
<br />
[3] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski, “Faster neural networks straight from jpeg,” in Advances in Neural Information Processing Systems, 2018, pp. 3933–3944.<br />
<br />
[4] Kresimir Delac, Mislav Grgic, and Sonja Grgic, “Effects of jpeg and jpeg2000 compression on face recognition,” in Pattern Recognition and Image Analysis, Sameer Singh, Maneesha Singh, Chid Apte, and Petra Perner, Eds., Berlin, Heidelberg, 2005, pp. 136–145, Springer Berlin Heidelberg.<br />
<br />
[5] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Towards image understanding from deep compression without decoding,” 2018.<br />
<br />
[6] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy, “Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 2016, MobiSys ’16, pp. 123–136, ACM.<br />
<br />
[7] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.<br />
<br />
[8] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” CoRR, vol. abs/1610.02391, 2016.<br />
<br />
[9] George Toderici, Sean M. O'Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, Rahul Sukthankarm, Variable Rate Image Compression with Recurrent Neural Networks, ICLR 2016, arXiv:1511.06085<br />
<br />
[10] Johannes Ballé, Valero Laparra, Eero P. Simoncelli, End-to-end Optimized Image Compression, ICLR 2017, arXiv:1611.01704<br />
<br />
[11] Lucas Theis, Wenzhe Shi, Andrew Cunningham, Ferenc Huszár, Lossy Image Compression with Compressive Autoencoders, ICLR 2017, arXiv:1703.00395<br />
<br />
[12] Kauffmann L, Ramanoël S and Peyrin C (2014) The neural bases of spatial frequency processing during scene perception. Front. Integr. Neurosci. 8:37. doi: 10.3389/fnint.2014.00037</div>Ahamsalahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=what_game_are_we_playingwhat game are we playing2020-11-12T00:15:19Z<p>X277lu: /* Introduction */</p>
<hr />
<div>== Authors == <br />
Yuxin Wang, Evan Peters, Yifan Mou, Sangeeth Kalaichanthiran <br />
<br />
== Introduction ==<br />
Recently, there have been intensive studies of methods using AI to search for the optimal solution of large-scale, zero-sum and extensive form problems. However, most of these works operate under the assumption that the parameters of the game are known, while this scenario is ideal. In real-world problems, most of the parameters are unknown prior to the game. This paper proposes a framework to find an optimal solution using a primal-dual Newton Method and then using back-propagation to analytically compute the gradients of all the relevant game parameters.<br />
<br />
The approach to solving this problem is to consider ''quantal response equilibrium'' (QRE), which is a generalization of Nash equilibrium (NE) where the agents can make suboptimal decisions. It is shown that the solution to the QRE is a differentiable function of the payoff matrix. Consequently, back-propagation can be used to analytically solve for the payoff matrix (or other game parameters). This strategy has many future application areas as it allows for game-solving (both extensive and normal form) to be integrated as a module in a deep neural network.<br />
<br />
An example of architecture is presented below:<br />
<br />
[[File:Framework.png ]]<br />
<br />
Payoff matrix <math> P </math> is parameterized by a domain-dependent low dimensional vector <math> \phi </math>, where <math> \phi </math> depends on a differentiable function <math> M_1(x) </math>. Furthermore, <math> P </math> is applied to QRE to get the equilibrium strategies <math> (u^∗, v^∗) </math>. Lastly, the loss function is calculated after applying any differentiable <math> M_2(u^∗, v^∗) </math>.<br />
<br />
The effectiveness of this model is demonstrated using the games “Rock, Paper, Scissors”, one-card poker, and a security defense game. This could represent a substantial step forward in understanding how game-theoretic methods can be applied to uncertain settings by the ability to learn solely from observed actions and the relevant parameters.<br />
<br />
== Learning and Quantal Response in Normal Form Games ==<br />
<br />
The game-solving module provides all elements required in differentiable learning, which maps contextual features to payoff matrices, and computes equilibrium strategies under a set of contextual features. This paper will learn zero-sum games and start with normal form games since they have game solver and learning approach capturing much of intuition and basic methodology.<br />
<br />
=== Zero-Sum Normal Form Games ===<br />
<br />
A normal form game is defined as a game with a finite number of players, <math>N</math>, where each player has a non-empty and finite action space, <math>A_i</math>. The outcome of the game is defined by the action profile <math>a = (a_1,...,a_n)</math>, where <math>a_i</math> is the action taken by the <math>i^{th}</math> player. In addition, each player has an associated utility function that is given by <math>u_i: A_1\times...\times A_n \rightarrow\mathbb{R}</math>. An example of a normal form game is the prisoner's dilemma. Zero-sum games refer to a game in which one player gains at the detriment of the other player. Mathematically, <math>\forall a\in A_1\times A_2, u_2(a) + u_2(a) = 0</math> (Larson, "Game-theoretic methods for computer science Normal Form Games"). <br />
<br />
In two-player zero-sum games there is a '''payoff matrix''' <math>P</math> that describes the rewards for two players employing specific strategies u and v respectively. The optimal strategy mixture may be found with a classic min-max formulation:<br />
$$\min_u \max_v \ u^T P v \\ subject \ to \ 1^T u =1, u \ge 0 \\ 1^T v =1, v \ge 0. \ $$<br />
<br />
Here, we consider the case where <math>P</math> is not known a priori. <math>P</math> could be a single fixed but unknown payoff<br />
matrix, or depend on some external context <math>x</math>. The solution <math> (u^*, v_0) </math> to this optimization and the solution <math> (u_0,v^*) </math> to the corresponding problem with inverse player order form the Nash equilibrium <math>(u^*,v^*) </math>. At this equilibrium, the players do not have anything to gain by changing their strategy, so this point is a stable state of the system. When the payoff matrix P is not known, we observe samples of actions <math> a^{(i)}, i =1,...,N </math> from one or both players, which depends on some external content <math> x </math>, sampled from the equilibrium strategies <math>(u^*,v^*) </math>, to recover the true underlying payoff matrix P or a function form P(x) depending on the current context.<br />
<br />
=== Quantal Response Equilibria ===<br />
<br />
However, NE is poorly suited because NEs are overly strict, discontinuous with respect to P, and may not be unique. To address these issues, model the players' actions with the '''quantal response equilibria''' (QRE), where noise is added to the payoff matric. Specifically, consider the ''logit'' equilibrium for zero-sum games that obeys the fixed point:<br />
$$<br />
u^* _i = \frac {exp(-Pv)_i}{\sum_{q \in [n]} exp (-Pv)_q}, \ v^* _j= \frac {exp(P^T u)_j}{\sum_{q \in [m]} exp (P^T u)_q} .\qquad \ (1)<br />
$$<br />
For a fixed opponent strategy, the logit equilibrium corresponding to a strategy is strictly convex, and thus the regularized best response is unique.<br />
<br />
=== End-to-End Learning ===<br />
<br />
Then to integrate a zero-sum solver, [1] introduced a method to solve the QRE and to differentiate through its solution.<br />
<br />
'''QRE solver''':<br />
To find the fixed point in (1), it is equivalent to solve the regularized min-max game:<br />
$$<br />
\min_{u \in \mathbb{R}^n} \max_{v \in \mathbb{R}^m} \ u^T P v -H(v) + H(u) \\<br />
\text{subject to } 1^T u =1, \ 1^T v =1, <br />
$$<br />
where H(y) is the Gibbs entropy <math> \sum_i y_i log y_i</math>.<br />
Entropy regularization guarantees the non-negative condition and makes the equilibrium continuous with respect to P, which means players are encouraged to play more randomly, and all actions have a non-zero probability. Moreover, this problem has a unique saddle point corresponding to <math> (u^*, v^*) </math>.<br />
<br />
Using a primal-dual Newton Method to solve the QRE for two-player zero-sum games, the KKT conditions for the problem are:<br />
$$ <br />
Pv + \log(u) + 1 +\mu 1 = 0 \\<br />
P^T v -\log(v) -1 +\nu 1 = 0 \\<br />
1^T u = 1, \ 1^T v = 1, <br />
$$<br />
where <math> (\mu, \nu) </math> are Lagrange multipliers for the equality constraints on u, v respectively. Then applying Newton's method gives the the update rule:<br />
$$<br />
Q \begin{bmatrix} \Delta u \\ \Delta v \\ \Delta \mu \\ \Delta \nu \\ \end{bmatrix} = - \begin{bmatrix} P v + \log u + 1 + \mu 1 \\ P^T u - \log v - 1 + \nu 1 \\ 1^T u - 1 \\ 1^T v - 1 \\ \end{bmatrix}, \qquad (2)<br />
$$<br />
where Q is the Hessian of the Lagrangian, given by <br />
$$ <br />
Q = \begin{bmatrix} <br />
diag(\frac{1}{u}) & P & 1 & 0 \\ <br />
P^T & -diag(\frac{1}{v}) & 0 & 1\\<br />
1^T & 0 & 0 & 0 \\<br />
0 & 1^T & 0 & 0 \\<br />
\end{bmatrix}. <br />
$$<br />
<br />
'''Differentiating Through QRE Solutions''':<br />
The QRE solver provides a method to compute the necessary Jacobian-vector products. Specifically, we compute the gradient of the loss given the solution <math> (u^*,v^*) </math> to the QRE, and some loss function <math> L(u^*,v^*) </math>: <br />
<br />
1. Take differentials of the KKT conditions: <br />
<math><br />
Q \begin{bmatrix} <br />
du & dv & d\mu & d\nu \\ <br />
\end{bmatrix} ^T = \begin{bmatrix} <br />
-dPv & -dP^Tu & 0 & 0 \\ <br />
\end{bmatrix}^T. \ <br />
</math><br />
<br />
2. For small changes du, dv, <br />
<math><br />
dL = \begin{bmatrix} <br />
v^TdP^T & u^TdP & 0 & 0 \\ <br />
\end{bmatrix} Q^{-1} \begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
3. Apply this to P, and take limits as dP is small:<br />
<math><br />
\nabla_P L = y_u v^T + u y_v^T, \qquad (3)<br />
</math> where <br />
<math><br />
\begin{bmatrix} <br />
y_u & y_v & y_{\mu} & y_{\nu}\\ <br />
\end{bmatrix}=Q^{-1}\begin{bmatrix} <br />
-\nabla_u L & -\nabla_v L & 0 & 0 \\ <br />
\end{bmatrix}^T.<br />
</math><br />
<br />
Hence, the forward pass is given by using the expression in (2) to solve for the logit equilibrium given P, and the backward pass is given by using <math> \nabla_u L </math> and <math> \nabla_v L </math> to obtain <math> \nabla_P L </math> using (3). There does not always exist a unique P which generates <math> u^*, v^* </math> under the logit QRE, and we cannot expect to recover P when under-constrained.<br />
<br />
== Learning Extensive form games ==<br />
<br />
The normal form representation for games where players have many choices quickly becomes intractable. For example, consider a chess game: One the first turn, player 1 has 20 possible moves and then player 2 has 20 possible responses. If in the following number of turns each player is estimated to have ~30 possible moves and if a typical game is 40 moves per player, the total number of strategies is roughly <math>10^{120} </math> per player (this is known as the Shannon number for game-tree complexity of chess) and so the payoff matrix for a typical game of chess must therefore have <math> O(10^{240}) </math> entries.<br />
<br />
Instead, it is much more useful to represent the game graphically as an "''' Extensive form game'''" (EFG). We'll also need to consider types of games where there is '''imperfect information''' - players do not necessarily have access to the full state of the game. An example of this is one-card poker: (1) Each player draws a single card from a 13-card deck (ignore suits) (2) Player 1 decides whether to bet/hold (3) Player 2 decides whether to call/raise (4) Player 1 must either call/fold if Player 2 raised. From this description, player 1 has <math> 2^{13} </math> possible first moves (all combinations of (card, raise/hold)) and has <math> 2^{13} </math> possible second moves (whenever player 1 gets a second move) for a total of <math> 2^{26} </math> possible strategies. In addition, Player 1 never knows what cards player 2 has and vice versa. So instead of representing the game with a huge payoff matrix, we can instead represent it as a simple decision tree (for a ''single'' drawn the card of player 1):<br />
<br />
<br />
<center> [[File:1cardpoker.PNG]] </center><br />
<br />
where player 1 is represented by "1", a node that has two branches corresponding to the allowed moves of player 1. However there must also be a notion of information available to either player: While this tree might correspond to say, player 1 holding a "9", it contains no information on what card player 2 is holding (and is much simpler because of this). This leads to the definition of an '''information set''': the set of all nodes belonging to a single player for which the other player cannot distinguish which node has been reached. The information set may therefore be treated as a node itself, for which actions stemming from the node must be chosen in ignorance of what the other player did immediately before arriving at the node. In the poker example, the full game tree consists of a much more complex version of the tree shown above (containing repetitions of the given tree for every possible combination of cards dealt) and the and an example of the information set for player 1 is the set of all of the nodes owned by player 2 that immediately follow player 1's decision to hold. In other words, if player 1 holds there are 13 possible nodes describing the responses of player 2 (raise/hold for player 2 having card = ace, 1, ... King), and all 13 of these nodes are indistinguishable to player 1, and so form the information set for player 1.<br />
<br />
The following is a review of important concepts for extensive form games first formalized in [2]. Let <math> \mathcal{I}_i </math> be the set of all information sets for player i, and for each <math> t \in \mathcal{I}_i </math> let <math> \sigma_t </math> be the actions taken by player i to arrive at <math> t </math> and <math> C_t </math> be the actions that player i can take from <math> u </math>. Then the set of all possible sequences that can be taken by player i is given by<br />
<br />
$$<br />
S_i = \{\emptyset \} \cup \{ \sigma_t c | u\in \mathcal{I}_i, c \in C_t \}<br />
$$<br />
<br />
So for the one-card poker we would have <math>S_1 = \{\emptyset, \text{raise}, \text{hold}, \text{hold-call}, \text{hold-fold\} }</math>. From the possible sequences follows two important concepts:<br />
<ol><br />
<li>The EFG '''payoff matrix''' <math> P </math> of size <math>|S_1| \times |S_2| </math> (this is all possible actions available to either player), is populated with rewards from each leaf of the tree (or "zero" for each <math> (s_1, s_2) </math> that is an invalid pair), and the expected payoff for realization plans <math> (u, v) </math> is given by <math> u^T P v </math> </li><br />
<li> A '''realization plan''' <math> u \in \mathbb{R}^{|S_1|} </math> for player 1 (<math> v \in \mathbb{R}^{|S_2|} </math> for player 2 ) will describe probabilities for players to carry out each possible sequence, and each realization plan must be constrained by (i) compatibility of sequences (e.g. "raise" is not compatible with "hold-call") and (ii) information sets available to the player. These constraints are linear:<br />
<br />
$$<br />
Eu = e \\<br />
Fv = f<br />
$$<br />
<br />
where <math> e = f = (1, 0, ..., 0)^T </math> and <math> E, F</math> contain entries in <math> {-1, 0, 1} </math> describing compatibility and information sets. </li><br />
<br />
</ol> <br />
<br />
<br />
The paper's main contribution is to develop a minimax problem for extensive form games:<br />
<br />
<br />
$$<br />
\min_u \max_v u^T P v + \sum_{t\in \mathcal{I}_1} \sum_{c \in C_t} u_c \log \frac{u_c}{u_{p_t}} - \sum_{t\in \mathcal{I}_2} \sum_{c \in C_t} v_c \log \frac{v_c}{v_{p_t}}<br />
$$<br />
<br />
where <math> p_t </math> is the action immediately preceding information set <math> t </math>. Intuitively, each sum resembles a cross-entropy over the distribution of probabilities in the realization plan comparing each probability to proceed from the information set to the probability to arrive at that information set. Importantly, these entropies are strictly convex or concave (for player 1 and player 2 respectively) [3] so that the min-max problem will have a unique solution and ''the objective function is continuous and continuously differentiable'' - this means there is a way to optimize the function. As noted in Theorem 1 of [1], the solution to this problem is equivalently a solution for the QRE of the game in reduced normal form.<br />
<br />
Minimax can also be seen from an algorithmic perspective. Referring to the above figure containing a tree, it contains a sequence of states and action which alternates between two or more competing players. The above formulation of the min-max problem essentially measures how well a decision rule is from the perspective of a single player. To describe it in terms of the tree, if it is player 1's turn, then it is a mutual recursion of player 1 choosing to maximize its payoff and player 2 choosing to minimize player 1's payoff.<br />
<br />
Having decided on a cost function, the method of Lagrange multipliers may be used to construct the Lagrangian that encodes the known constraints (<math> Eu = e \,, Fv = f </math>, and <math> u, v \geq 0</math>), and then optimize the Lagrangian using Newton's method (identically to the previous section). Accounting for the constraints, the Lagrangian becomes <br />
<br />
<br />
$$<br />
\mathcal{L} = g(u, v) + \sum_i \mu_i(Eu - e)_i + \sum_i \nu_i (Fv - f)_i<br />
$$<br />
<br />
where <math>g</math> is the argument from the minimax statement above and <math>u, v \geq 0</math> become KKT conditions. The general update rule for Newton's method may be written in terms of the derivatives of <math> \mathcal{L} </math> with respect to primal variables <math>u, v </math> and dual variables <math> \mu, \nu</math>, yielding:<br />
<br />
$$<br />
\nabla_{u,v,\mu,\nu}^2 \mathcal{L} \cdot (\Delta u, \Delta v, \Delta \mu, \Delta \nu)^T= - \nabla_{u,v,\mu,\nu} \mathcal{L}<br />
$$<br />
where <math>\nabla_{u,v,\mu,\nu}^2 \mathcal{L} </math> is the Hessian of the Lagrangian and <math>\nabla_{u,v,\mu,\nu} \mathcal{L} </math> is simply a column vector of the KKT stationarity conditions. Combined with the previous section, this completes the goal of the paper: To construct a differentiable problem for learning normal form and extensive form games.<br />
<br />
== Experiments ==<br />
<br />
The authors demonstrated learning on extensive form games in the presence of ''side information'', with ''partial observations'' using three experiments. In all cases, the goal was to maximize the likelihood of realizing an observed sequence from the player, assuming they act in accordance with the QRE. The authors found that the best way to implement the module was to use a medium to large batch size, RMSProp, or Adam optimizers with a learning rate between <math>\left[0.0001,0.01\right].</math> <br />
<br />
=== Rock, Paper, Scissors ===<br />
<br />
Rock, Paper, Scissors is a 2-player zero-sum game. For this game, the best strategy to reach a Nash equilibrium and a Quantal response equilibrium is to uniformly play each hand with equal odds.<br />
The first experiment was to learn a non-symmetric variant of Rock, Paper, Scissors with ''incomplete information'' with the following payoff matrix:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix of modified Rock-Paper-Scissors''<br />
! <br />
! ''Rock''<br />
! ''Paper''<br />
! ''Scissors''<br />
|-<br />
! ''Rock''<br />
| '''''0'''''<br />
| <math>-b_1</math><br />
| <math>b_2</math><br />
|-<br />
! ''Paper''<br />
| <math>b_1</math><br />
| '''''0'''''<br />
| <math>-b_3</math><br />
|-<br />
! ''Scissors''<br />
| <math>-b_2</math><br />
| <math>b_3</math><br />
| '''''0'''''<br />
|}<br />
<br />
where each of the <math> b </math> ’s are linear function of some features <math> x \in \mathbb{R}^{2} </math> (i.e., <math> b_y = x^Tw_y </math>, <math> y \in </math> {<math>1,2,3</math>} , where <math> w_y </math> are to be learned by the algorithm). Using many trials of random rewards the technique produced the following results for optimal strategies[1]: <br />
<br />
[[File:RPS Results.png|500px ]]<br />
<br />
From the graphs above, we can tell the following: 1) both parameters learned and predicted strategies improve with a larger dataset; 2) with a reasonably sized dataset, >1000 here, convergence is stable and is fairly quick.<br />
<br />
=== One-Card Poker ===<br />
<br />
Next they investigated extensive form games using the one-Card Poker (with ''imperfect information'') introduced in the previous section. In the experimental setup, they used a deck stacked non-uniformly (meaning repeat cards were allowed). Their goal was to learn this distribution of cards from observations of many rounds of the play. Different from the distribution of cards dealt, the method built in the paper is more suited to learn the player’s perceived or believed distribution of cards. It may even be a function of contextual features such as demographics of players. Three experiments were run with <math> n=4 </math>. Each experiment comprised 5 runs of training, with same weights but different training sets. Let <math> d \in \mathbb{R}^{n}, d \ge 0, \sum_{i} d_i = 1 </math> be the weights of the cards. The probability that the players are dealt cards <math> (i,j) </math> is <math> \frac{d_i d_j}{1-d_i} </math>. This distribution is asymmetric between players. Matrix <math> P, E, F </math> for the case <math> n=4 </math> are presented in [1]. With training for 2500 epochs, the mean squared error of learned parameters (card weights, <math> u, v </math> ) are averaged over all runs of and are presented as following [1]: <br />
<br />
<br />
[[File:One-card_Poker_Results.png|500px ]]<br />
<br />
=== Security Resource Allocation Game ===<br />
<br />
<br />
From Security Resource Allocation Game, they demonstrated the ability to learn from ''imperfect observations''. The defender possesses <math> k </math> indistinguishable and indivisible defensive resources which he splits among <math> n </math> targets, { <math> T_1, ……, T_n </math>}. The attacker chooses one target. If the attack succeeds, the attacker gets <math> R_i </math> reward and defender gets <math> -R_i </math>, otherwise zero payoff for both. If there are n defenders guarding <math> T_i </math>, probability of successful attack on <math> T_i </math> is <math> \frac{1}{2^n} </math>. The expected payoff matrix when <math> n = 2, k = 3 </math>, where the attackers are the row players is:<br />
<br />
{| class="wikitable" style="float:center; margin-left:1em; text-align:center;"<br />
|+ align="bottom"|''Payoff matrix when <math> n = 2, k = 3 </math>''<br />
! {#<math>D_1</math>,#<math>D_2</math>}<br />
! {0, 3}<br />
! {1, 2}<br />
! {2, 1}<br />
! {3, 0}<br />
|-<br />
! <math>T_1</math><br />
| <math>-R_1</math><br />
| <math>-\frac{1}{2}R_1</math><br />
| <math>-\frac{1}{4}R_1</math><br />
| <math>-\frac{1}{8}R_1</math><br />
|-<br />
! <math>T_2</math><br />
| <math>-\frac{1}{8}R_2</math><br />
| <math>-\frac{1}{4}R_2</math><br />
| <math>-\frac{1}{2}R_2</math><br />
| <math>-R_2</math><br />
|} <br />
<br />
<br />
For a multi-stage game the attacker can launch <math> t </math> attacks, one in each stage while defender can only stick with stage 1. The attacker may change target if the attack in stage 1 is failed. Three experiments are run with <math> n = 2, k = 5 </math> for games with single attack and double attack, i.e, <math> t = 1 </math> and <math> t = 2 </math>. The results of simulated experiments are shown below [1]:<br />
<br />
[[File:Security Game Results.png|500px ]]<br />
<br />
<br />
They learned <math> R_i </math> only based on observations of the defender’s actions and could still recover the game set by only observing the defender’s actions. Same as expectation, the larger dataset size improves the learned parameters. Two outliers are 1) Security Game, the green plot for when <math> t = 2 </math>; and 2) RPS, when comparing between training sizes of 2000 and 5000.<br />
<br />
== Conclusion ==<br />
Unsurprisingly, the results of this study show that in general, the quality of learned parameters improved as the number of observations increased. The network presented in this paper demonstrated improvement over the existing methodology. <br />
<br />
This paper presents an end-to-end framework for implementing a game solver, for both extensive and normal form, as a module in a deep neural network for zero-sum games. This method, unlike many previous works in this area, does not require the parameters of the game to be known to the agent prior to the start of the game. The two-part method analytically computes both the optimal solution and the parameters of the game. Future work involves taking advantage of the KKT matrix structure to increase computation speed and extensions to the area of learning general-sum games.<br />
<br />
== Critiques ==<br />
The proposed method appears to suffer from two flaws. Firstly, the assumption that players behave in accordance with the QRE severely limits the space of player strategies and is known to exhibit pathological behavior even in one-player settings. Second, the solvers are computationally inefficient and are unable to scale. For this setting, they used Newton's method and it is found that the second-order algorithms do not scale to large games as with Nash equilibrium [4]. <br />
<br />
In the one-card poker section, it might be better to write "Next, they investigated extensive form games ... It may even be a function of contextual features such as the demographics of players. ... 5 runs of training, with the same weights but different training sets."<br />
<br />
Zero-Sum Normal Form Games usually follow Gaussian distributions with two non-empty sets of strategies of player one and player two correspondingly, and also a payoff function defined on the set of possible realizations.<br />
<br />
This method of proposing that real players will follow a laid out scheme or playing strategy is almost always a drastic oversimplification of real-world scenarios and severely limits the applications. In order to better fit a model to the real players, there almost always has to be some sort of stochastic implementation which accounts for the presence of irrational users in the system. <br />
<br />
It might be a good idea to discuss the algorithm performance based on more complicated player reactions, for example, player 2 or NPC from the game might react based on player 1's action. If player 2 would react with a "clever" choice, we might be able to shrink the choice space, which might be helpful to accelerate the training time.<br />
<br />
The theory assumes rational players, which means that roughly speaking, the players make decisions based on increasing their respective payoffs (utility values, preferences,..). However, in real-life scenarios, this might not hold true. Thus, it would be a good idea to include considerations regarding this issue in this summary. Another area that is worth exploring is the need for things like “bluffing” in games with hidden information. It would be interesting to see whether the results generated in this paper are in support of "bluffing" or not.<br />
<br />
It is good that the authors of this paper provided analysis on different types of games, but it would be great if they could also provide some future insights on this method such as whether it can be applied to more complex games such as blackjack poker / it would work or not.<br />
<br />
The authors evaluated the performance by plotting the mean squared error (MSE) of the training data. Prediction error on testing data can also be provided to better evaluate the prediction power.<br />
<br />
== References ==<br />
<br />
[1] Ling, C. K., Fang, F., & Kolter, J. Z. (2018). What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.<br />
<br />
[2] B. von Stengel. Efficient computation of behavior strategies.Games and Economics Behavior,14(0050):220–246, 1996.<br />
<br />
[3] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.<br />
<br />
[4] Farina, G., Kroer, C., & Sandholm, T. (2019). Online Convex Optimization for Sequential Decision Processes and Extensive-Form Games. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 1917-1925. https://doi.org/10.1609/aaai.v33i01.33011917<br />
<br />
[5] Larson, K. Game-theoretic methods for computer science Normal Form Games. Lecture presented at CS 886 in University of Waterloo, Waterloo. Retrieved from https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiL24iwprrtAhWJFFkFHU52D0MQFjAQegQIIhAC&url=https://cs.uwaterloo.ca/~klarson/teaching/F06-886/slides/normalform.pdf&usg=AOvVaw0j9fw9oRkUGZOsY8ZCyv6i</div>E6petershttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATIONDREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-11T04:02:39Z<p>A2chanan: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning (RL) is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning. It refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behavior over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. Along with the latent space representation, an actor-critic model is used to learn the reaction and optimize the behavior of the agent. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The term "model-free" in RL refers to not having an explicit model of the environment and its dynamics - there is still a model of the agent being learned. The authors have changed the belief representations to learn a critic, or value function, directly on latent state samples which help to enable scaling to more complex tasks.<br />
<br />
<br />
<br />
<br />
The main finding of the paper is that long-horizon behaviors can be learned by latent imagination. This avoids the short sightedness that comes with using finite imagination horizons. The authors have also managed to demonstrate empirical performance for visual control by evaluating the model on image inputs.<br />
<br />
[[File:Figure1 paper.png|100px|center]]<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
[[File:rl_loop.png|350px|center]]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take while learning is limited, for example due to computational resource limitations, sensitivity of the environment, or physical resource constraints. Thus, it is difficult to sufficiently interact with the environment until an accurate representation of the world is learned. The proposed method in this paper aims to solve this problem by "imagining" the states and rewards that an action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world with the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. From a high-level perspective, Dreamer first learns latent dynamics from past experience. Then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behaviour learning, and environment interaction. In the compact latent space of the world model, the behaviour is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behaviour Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:ashraf98.png|frameless|700px|Dreamer algorithm|center]]<br />
<br />
Notice that three neural networks are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively. The action model tries to solve the imagination environment by predicting various actions. Meanwhile, the value model estimates the expected rewards that the action model will achieve. Hence, these two models are trained cooperatively whereby the action model tries to maximize the estimated value while the value model gives the estimate based on the action model's actions.<br />
<br />
=== The Markovianity Question ===<br />
<br />
The paper formulates visual control as a so-called Partially Observable Markov Decision Processs (POMDP) in discrete time. Since the goal is for an agent to maximize its sum of rewards in a Markovian setting, this puts the model squarely in the category of reinforcement learning. In this subsection we provide a lengthier discussion on this Markovian assumption.<br />
<br />
Note that the transition distribution provided in the representation and transition models are Markovian in the states <math>s_t</math> and <math>a_t</math>. This mimics the dynamics in a non-linear Kalman filter and hidden Markov models. These techniques are described in the papers by Rabiner and Juang [5] as well as Kalman [6]. The difference with these presentations is that the latent dynamics are conditioned on actions and attempts to predict rewards, which allows the agent to imagine, yet not execute, actions in the provided environment.<br />
<br />
This short memory assumption is useful from a computational perspective as it allows for the problem to be tractable. It is also realistic, as an intelligent agent does not need the entire history of their environment going back all the way to the Big Bang to understand a situation they have not encountered before. We commend the team at UofT and Google Brain for this insight, as it makes their analysis reasonable and easy to understand.<br />
<br />
<br />
== Related Works ==<br />
<br />
Previous Works that exploited latent dynamics can be grouped in 3 sections:<br />
<br />
* Visual Control with latent dynamics by derivative-free policy learning or online planning.<br />
* Augment model-free agents with multi-step predictions.<br />
* Use analytic gradients of Q-values.<br />
<br />
While the later approaches are often for low-dimensional tasks, Dreamer uses analytic gradients to efficiently learn long-horizon behaviours for visual control purely by latent imagination.<br />
<br />
== Results ==<br />
The experiments were performed on 20 different control tasks of Deepmind Control Suite [7]. In the following picture we can see the reward vs the environment steps for a few of the experiments. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|center|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
<br />
The figure below summarizes Dreamer's performance compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance. Overall, it achieves the most consistent performance among competing algorithms. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviours with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|center|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
The performance is of Dreamer is also evaluated against state of the art reinforcement learning agents, which is shown below<br />
[[File:CaptureDream.PNG|frameless|center|500px]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. For example, consider a reinforcement learning agent who learns how to perform rare surgeries without enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. <br />
<br />
As future work on representation learning, the ability to scale latent imagination to higher visual complexity environments can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
The model components in Appendix A have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
Another fact about Dreamer is that it learns long-horizon behaviours purely by latent imagination, unlike previous approaches. It is also applicable to tasks with discrete actions and early episode termination.<br />
<br />
Learning a policy from visual inputs is a quite interesting research approach in RL. This paper steps in this direction by improving existing model-based methods (the world models and PlaNet) using the actor-critic approach. However, their method was an incremental contribution as back-propagating gradients through values and dynamics has been studied in previous works.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviours by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.<br />
<br />
[5] Rabiner, Lawrence, and B. Juang. "An introduction to hidden Markov models." IEEE ASSP magazine 3.1 (1986): 4-16.<br />
<br />
[6] Kalman, Rudolph Emil. "A new approach to linear filtering and prediction problems." (1960): 35-45.<br />
<br />
[7] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.</div>Byyouhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learningorthogonal gradient descent for continual learning2020-11-11T02:11:17Z<p>P2torabi: /* Orthogonal Gradient Descent */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Presented By == <br />
Parsa Torabian<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real-world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt, or more numerically stable equivalent (see Appendix at the end of this summary).<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png|centre]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px|centre]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The below figure shows the performance comparison of different methods when applied on the permuted MNIST. The comparison is made based on accuracy across 3 different tasks. Training is done for 15 epochs (5 for each of the three permutations). The switch in permutations is indicated in the graph with verticle lines.<br />
<br />
[[File:PMNIST_perf.PNG|centre]]<br />
<br />
The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG|centre]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. The following figure shows the accuracies of different methods when trained on Rotated MNIST with different degrees. Each method is trained for 10 epochs (5 on standard MNIST and 5 on rotated MNIST) and predictions are made over the original MNIST. Each accuracy bar is a mean over 10 runs.<br />
<br />
[[File:RMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:RMNIST.PNG|centre]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. The following figure shows the accuracies of different methods when trained on Split MNIST.<br />
<br />
[[File:SMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:SMNIST.PNG|centre]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of the number of gradients stored.<br />
<br />
[[File:ogd.png|centre]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. The authors are able to reduce the storage size in practice by computing gradients with respect to the average of the logits and storing only a subset of the gradients in each task. Finding additional ways to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [12] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [12] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== Appendix: Numerical Instability of Gram-Schmidt ==<br />
The normal Gram-Schmidt procedure is known to suffer from poor numerical accuracy, and there exists alternatives with better numerical properties. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Let <math>A</math> be a real square matrix; this matrix accepts a QR decomposition, namely <math>A=\hat{Q}\hat{R}</math>, where <math>Q</math> is orthogonal and <math>R</math> is upper triangular. The prove of existence of a QR decomposition can be obtained using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
Note that this procedure can be trivially extended to complex square matrices, but in this case the matrix <math>Q</math> becomes unitary; i.e. <math>Q^* Q = QQ* = I</math>; this yields an easy extension of the orthogonal gradient descent algorithm for complex neural networks.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>P2torabihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Graph_Structure_of_Neural_NetworksGraph Structure of Neural Networks2020-11-10T03:31:53Z<p>M59jiang: /* Introduction */</p>
<hr />
<div>= Presented By =<br />
<br />
Xiaolan Xu, Robin Wen, Yue Weng, Beizhen Chang<br />
<br />
= Introduction =<br />
<br />
A deep neural network is composed of neurons organized into layers and the connections between them. Neural network architecture can be captured by its "computational graph," where neurons are represented as nodes that perform mathematical computations and directed edges that link neurons in different layers. This graphical representation demonstrates how the network transmits and transforms information through its input neurons through the hidden layers and ultimately to the output neurons. <br />
<br />
However, few researchers have focused on the relationship between neural networks and their predictive performance, which is the main focus of this paper. The aim is to help explain how the addition or deletion of layers, their links, and the number of nodes can impact the performance of the network.<br />
<br />
In Neural Network research, it is often important to build a relation between a neural network’s accuracy and its underlying graph structure. It would contribute to building more efficient and more accurate neural network architectures. A natural choice is to use a computational graph representation. Doing so, however, has many limitations, including: <br />
<br />
(1) Lack of generality: Computational graphs have to follow allowed graph properties, which limits the use of the rich tools developed for general graphs. <br />
<br />
(2) Disconnection with biology/neuroscience: Biological neural networks have a much more complicated and less standardized structure. For example, there might be information exchanges in brain networks. It isn't easy to represent these models with directed acyclic graphs. This disconnection between biology/neuroscience makes knowledge less transferable and interdisciplinary research more difficult.<br />
<br />
Thus, the authors developed a new way of representing a neural network as a graph, called a relational graph. The new representation's key insight is to focus on message exchange, rather than just on directed data flow. For example, for a fixed-width fully-connected layer, an input channel and output channel pair can be represented as a single node, while an edge in the relational graph can represent the message exchange between the two nodes. Under this formulation, using the appropriate message exchange definition, it can be shown that the relational graph can represent many types of neural network layers.<br />
<br />
WS-flex is a graph generator that allows systematically exploring the design space of neural networks. Neural networks are characterized by the clustering coefficient and average path length of their relational graphs under neuroscience insights. Based on the insights from neuroscience, the authors characterize neural networks by the clustering coefficient and average path lengths of their relational graphs.<br />
<br />
= Neural Network as Relational Graph =<br />
<br />
The author proposes the concept of relational graphs to study the graphical structure of neural network. Each relational graph is based on an undirected graph <math>G =(V; E)</math>, where <math>V =\{v_1,...,v_n\}</math> is the set of all the nodes, and <math>E \subseteq \{(v_i,v_j)|v_i,v_j\in V\}</math> is the set of all edges that connect nodes. Note that for the graph used here, all nodes have self edges, that is <math>(v_i,v_i)\in E</math>. Graphically, a self edge connects an edge to itself, which forms a loop.<br />
<br />
To build a relational graph that captures the message exchange between neurons in the network, we associate various mathematical quantities to the graph <math>G</math>. First, a feature quantity <math>x_v</math> is associated with each node. The quantity <math>x_v</math> might be a scalar, vector or tensor depending on what type of neural network it is (see the Table at the end of the section). Then a message function <math>f_{uv}(·)</math> is associated with every edge in the graph. A message function specifically takes a node’s feature as the input and then outputs a message. An aggregation function <math>{\rm AGG}_v(·)</math> then takes a set of messages (the outputs of the message function) and outputs the updated node feature. <br />
<br />
A relational graph is a graph <math>G</math> associated with several message exchange rounds, which transform the feature quantity <math>x_v</math> with the message function <math>f_{uv}(·)</math> and the aggregation function <math>{\rm AGG}_v(·)</math>. At each round of message exchange, each node sends messages to its neighbors and aggregates incoming messages from its neighbors. Each message is transformed at each edge through the message function, then they are aggregated at each node via the aggregation function. Suppose we have already conducted <math>r-1</math> rounds of message exchange, then the <math>r^{th}</math> round of message exchange for a node <math>v</math> can be described as<br />
<br />
<div style="text-align:center;"><math>\mathbf{x}_v^{(r+1)}= {\rm AGG}^{(r)}(\{f_v^{(r)}(\textbf{x}_u^{(r)}), \forall u\in N(v)\})</math></div> <br />
<br />
where <math>\mathbf{x}^{(r+1)}</math> is the feature of the <math>v</math> node in the relational graph after the <math>r^{th}</math> round of update. <math>u,v</math> are nodes in Graph <math>G</math>. <math>N(u)=\{u|(u,v)\in E\}</math> is the set of all the neighbor nodes of <math>u</math> in graph <math>G</math>.<br />
<br />
To further illustrate the above, we use the basic Multilayer Perceptron (MLP) as an example. An MLP consists of layers of neurons, where each neuron performs a weighted sum over scalar inputs and outputs, followed by some non-linearity. Suppose the <math>r^{th}</math> layer of an MLP takes <math>x^{(r)}</math> as input and <math>x^{(r+1)}</math> as output, then a neuron computes <br />
<br />
<div style="text-align:center;"><math>x_i^{(r+1)}= \sigma(\Sigma_jw_{ij}^{(r)}x_j^{(r)})</math>.</div> <br />
<br />
where <math>w_{ij}^{(r)}</math> is the trainable weight and <math>\sigma</math> is the non-linearity function. Let's first consider the special case where the input and output of all the layers <math>x^{(r)}</math>, <math>1 \leq r \leq R </math> have the same feature dimensions <math>d</math>. In this scenario, we can have <math>d</math> nodes in the Graph <math>G</math> with each node representing a neuron in MLP. Each layer of neural network will correspond with a round of message exchange, so there will be <math>R</math> rounds of message exchange in total. The aggregation function here will be the summation with non-linearity transform <math>\sigma(\Sigma)</math>, while the message function is simply the scalar multipication with weight. A fully-connected, fixed-width MLP layer can then be expressed with a complete relational graph, where each node <math>x_v</math> connects to all the other nodes in <math>G</math>, that is neighborhood set <math>N(v) = V</math> for each node <math>v</math>. The figure below shows the correspondence between the complete relation graph with a 5-layer 4-dimension fully-connected MLP.<br />
<br />
<div style="text-align:center;">[[File:fully_connnected_MLP.png]]</div><br />
<br />
In fact, a fixed-width fully-connected MLP is only a special case under a much more general model family, where the message function, aggregation function, and most importantly, the relation graph structure can vary. The different relational graph will represent the different topological structure and information exchange pattern of the network, which is the property that the paper wants to examine. The plot below shows two examples of non-fully connected fixed-width MLP and their corresponding relational graphs. <br />
<br />
<div style="text-align:center;">[[File:otherMLP.png]]</div><br />
<br />
We can generalize the above definitions for fixed-width MLP to Variable-width MLP, Convolutional Neural Network (CNN), and other modern network architecture like Resnet by allowing the node feature quantity <math>\textbf{x}_j^{(r)}</math> to be a vector or tensor respectively. In this case, each node in the relational graph will represent multiple neurons in the network, and the number of neurons contained in each node at each round of message exchange does not need to be the same, which gives us a flexible representation of different neural network architecture. The message function will then change from the simple scalar multiplication to either matrix/tensor multiplication or convolution. And the weight matrix will change to the convolutional filter. The representation of these more complicated networks are described in details in the paper, and the correspondence between different networks and their relational graph properties is summarized in the table below. <br />
<br />
<div style="text-align:center;">[[File:relational_specification.png]]</div><br />
<br />
Overall, relational graphs provide a general representation for neural networks. With proper definitions of node features and message exchange, relational graphs can represent diverse neural architectures, thereby allowing us to study the performance of different graph structures.<br />
<br />
= Exploring and Generating Relational Graphs=<br />
<br />
We will deal with the design and how to explore the space of relational graphs in this section. There are three parts we need to consider:<br />
<br />
(1) '''Graph measures''' that characterize graph structural properties:<br />
<br />
We will use one global graph measure, average path length, and one local graph measure, clustering coefficient.<br />
To explain clearly, average path length measures the average shortest path distance between any pair of nodes. The clustering coefficient measures the proportion of edges between the nodes within a given node’s neighborhood, divided by the number of edges that could possibly exist between them, averaged over all nodes.<br />
<br />
(2) '''Graph generators''' that can generate the diverse graph:<br />
<br />
With selected graph measures, we use a graph generator to generate diverse graphs to cover a large span of graph measures. To figure out the limitation of the graph generator and find out the best, we investigate some generators including ER, WS, BA, Harary, Ring, Complete graph and results are shown below:<br />
<br />
<div style="text-align:center;">[[File:3.2 graph generator.png]]</div><br />
<br />
Thus, from the picture, we could obtain the WS-flex graph generator that can generate graphs with a wide coverage of graph measures. Notably, WS-flex graphs almost encompass all the graphs generated by classic random generators mentioned above.<br />
<br />
(3) '''Computational Budget''' that we need to control so that the differences in performance of different neural networks are due to their diverse relational graph structures.<br />
<br />
It is important to ensure that all networks have approximately the same complexities so that the differences in performance are due to their relational graph structures when comparing neutral work by their diverse graph.<br />
<br />
We use floating point operations (FLOPS) which measures the number of multiply-adds as the metric. We first compute the FLOPS of our baseline network instantiations (i.e. complete relational graph) and use them as the reference complexity in each experiment. From the description in section 2, a relational graph structure can be instantiated as a neural network with variable width. Therefore, we can adjust the width of a neural network to match the reference complexity without changing the relational graph structures.<br />
<br />
= Experimental Setup =<br />
The author studied the performance of 3942 sampled relational graphs (generated by WS-flex from the last section) of 64 nodes with two experiments: <br />
<br />
(1) CIFAR-10 dataset: 10 classes, 50K training images, and 10K validation images<br />
<br />
Relational Graph: all 3942 sampled relational graphs of 64 nodes<br />
<br />
Studied Network: 5-layer MLP with 512 hidden units<br />
<br />
The model was trained for 200 epochs with batch size 128, using cosine learning rate schedule with an initial learning rate of 0.1<br />
<br />
<br />
(2) ImageNet classification: 1K image classes, 1.28M training images and 50K validation images<br />
<br />
Relational Graph: Due to high computational cost, 52 graphs are uniformly sampled from the 3942 available graphs.<br />
<br />
Studied Network: <br />
*ResNet-34, which only consists of basic blocks of 3×3 convolutions (He et al., 2016)<br />
<br />
*ResNet-34-sep, a variant where we replace all 3×3 dense convolutions in ResNet-34 with 3×3 separable convolutions (Chollet, 2017)<br />
<br />
*ResNet-50, which consists of bottleneck blocks (He et al., 2016) of 1×1, 3×3, 1×1 convolutions<br />
<br />
*EfficientNet-B0 architecture (Tan & Le, 2019)<br />
<br />
*8-layer CNN with 3×3 convolution<br />
<br />
= Results and Discussions =<br />
<br />
The paper summarizes the result of the experiment among multiple different relational graphs through sampling and analyzing and list six important observations during the experiments, These are:<br />
<br />
* There always exists a graph structure that has higher predictive accuracy under Top-1 error compared to the complete graph<br />
<br />
* There is a sweet spot such that the graph structure near the sweet spot usually outperforms the base graph<br />
<br />
* The predictive accuracy under top-1 error can be represented by a smooth function of Average Path Length <math> (L) </math> and Clustering Coefficient <math> (C) </math><br />
<br />
* The Experiments are consistent across multiple datasets and multiple graph structures with similar Average Path Length and Clustering Coefficient.<br />
<br />
* The best graph structure can be identified easily.<br />
<br />
* There are similarities between the best artificial neurons and biological neurons.<br />
<br />
<br />
<br />
[[File:Result2_441_2020Group16.png |1000px]]<br />
<br />
$$\text{Figure - Results from Experiments}$$<br />
<br />
<br />
'''Neural networks performance depends on its structure'''<br />
<br />
During the experiment, Top-1 errors for all sampled relational graph among multiple tasks and graph structures are recorded. The parameters of the models are average path length and clustering coefficient. Heat maps were created to illustrate the difference in predictive performance among possible average path length and clustering coefficient. In '''Figure - Results from Experiments (a)(c)(f)''', The darker area represents a smaller top-1 error which indicates the model performs better than the light area.<br />
<br />
Compared to the complete graph which has parameter <math> L = 1 </math> and <math> C = 1 </math>, the best performing relational graph can outperform the complete graph baseline by 1.4% top-1 error for MLP on CIFAR-10, and 0.5% to 1.2% for models on ImageNet. Hence it is an indicator that the predictive performance of the neural networks highly depends on the graph structure, or equivalently that the completed graph does not always have the best performance.<br />
<br />
<br />
'''Sweet spot where performance is significantly improved'''<br />
<br />
It had been recognized that training noises often results in inconsistent predictive results. In the paper, the 3942 graphs in the sample had been grouped into 52 bins, each bin had been colored based on the average performance of graphs that fall into the bin. By taking the average, the training noises had been significantly reduced. Based on the heat map '''Figure - Results from Experiments (f)''', the well-performing graphs tend to cluster into a special spot that the paper called “sweet spot” shown in the red rectangle, the rectangle is approximately included clustering coefficient in the range <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math>.<br />
<br />
<br />
'''Relationship between neural network’s performance and parameters'''<br />
<br />
When we visualize the heat map, we can see that there is no significant jump of performance that occurred as a small change of clustering coefficient and average path length ('''Figure - Results from Experiments (a)(c)(f)'''). In addition, if one of the variables is fixed in a small range, it is observed that a second-degree polynomial can be used to fit and approximate the data sufficiently well ('''Figure - Results from Experiments (b)(d)'''). Therefore, both the clustering coefficient and average path length are highly related to neural network performance by a convex U-shape. <br />
<br />
<br />
'''Consistency among many different tasks and datasets'''<br />
<br />
They observe that relational graphs with certain graph measures may consistently perform well regardless of how they are instantiated. The paper presents consistency uses two perspectives, one is qualitative consistency and another one is quantitative consistency.<br />
<br />
(1) '''Qualitative Consistency'''<br />
It is observed that the results are consistent from different points of view. Among multiple architecture dataset, it is observed that the clustering coefficient within <math>[0.1,0.7]</math> and average path length within <math>[1.5,3]</math> consistently outperform the baseline complete graph. <br />
<br />
(2) '''Quantitative Consistency'''<br />
Among different dataset with the network that has similar clustering coefficient and average path length, the results are correlated, The paper mentioned that ResNet-34 is much more complex than 5-layer MLP but a fixed set relational graph would perform similarly in both settings, with Pearson correlation of <math>0.658</math>, the p-value for the Null hypothesis is less than <math>10^{-8}</math>.<br />
<br />
<br />
'''Top architectures can be identified efficiently'''<br />
<br />
The computation cost of finding top architectures can be significantly reduced without training the entire data set for a large value of epoch or a relatively large sample. To achieve the top architectures, the number of graphs and training epochs need to be identified. For the number of graphs, a heatmap is a great tool to demonstrate the result. In the 5-layer MLP on CIFAR-10 example, taking a sample of the data around 52 graphs would have a correlation of 0.9, which indicates that fewer samples are needed for a similar analysis in practice. When determining the number of epochs, correlation can help to show the result. In ResNet34 on ImageNet example, the correlation between the variables is already high enough for future computation within 3 epochs. This means that good relational graphs perform well even at the<br />
initial training epochs.<br />
<br />
<br />
'''Well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks'''<br />
<br />
The way we define relational graphs and average length in the graph is similar to the way information is exchanged in network science. The biological neural network also has a similar relational graph representation and graph measure with the best-performing relational graph.<br />
<br />
While there is some organizational similarity between a computational neural network and a biological neural network, we should refrain from saying that both these networks share many similarities or are essentially the same with just different substrates. The biological neurons are still quite poorly understood and it may take a while before their mechanisms are better understood.<br />
<br />
= Conclusions=<br />
Our works provided a new viewpoint about combining the fields of graph neural networks (GNNs) and general architecture design by establishing graph structures. The authors believe that the model helps to capture the diverse neural network architectures under a unified framework. In particular, GNNs are instances of general neural architectures where graph structures are seen as input rather than part of the architecture, and message functions are shared across all edges of the graph. We have proved that other scientific disciplines, such as network science, neuroscience, etc., provide graphics techniques and methods to help understand and design deep neural networks. This could provide effective help and inspiration for the study of neural networks.<br />
<br />
= Critique =<br />
<br />
1. The experiment is only measuring on a single data set which might not be representative enough. As we can see in the whole paper, the "sweet spot" we talked about might be a special feature for the given data set only which is the CIFAR-10 data set. If we change the data set to another imaging data set like CK+, whether we are going to get a similar result is not shown by the paper. Hence, the result that is being concluded from the paper might not be representative enough. <br />
<br />
2. When we are fitting the model in practice, we will fit the model with more than one epoch. The order of the model fitting should be randomized since we should create more random jumps to avoid staying inside a local minimum. With the same order within each epoch, the data might be grouped by different classes or levels, the model might result in a better performance with certain classes and worse performance with other classes. In this particular example, without randomization of the training data, the conclusion might not be precise enough.<br />
<br />
3. This study shows empirical justification for choosing well-performing models from graphs differing only by average path length and clustering coefficient. An equally important question is whether there is a theoretical justification for why these graph properties may (or may not) contribute to the performance of a general classifier - for example, is there a combination of these properties that is sufficient to recover the universality theorem for MLP's?<br />
<br />
4. It might be worth looking into how to identify the "sweet spot" for different datasets.<br />
<br />
5. What would be considered a "best graph structure " in the discussion and conclusion part? It seems that the intermediate result of getting an accurate result was by binning graphs into smaller bins, what should we do if the graphs can not be binned into significantly smaller bins in order to proceed with the methodologies mentioned in the paper. Both CIFAR - 10 and ImageNet seem too trivial considering the amount of variation and categories in the dataset. What would the generalizability be to other presentations of images?<br />
<br />
6. There is an interesting insight that the idea of the relational graph is similar to applying causal graphs in neuro networks, which is also closer to biology and neuroscience because human beings learning things based on causality. This new approach may lead to higher prediction accuracy but it needs more assumptions, such as correct relations and causalities.<br />
<br />
7. This is an interesting topic that uses the knowledge in graph theory to introduce this new structure of Neural Networks. Using more data sets to discuss this approach might be more interesting, such as the MNIST dataset. We think it is interesting to discuss whether this structure will provide a better performance compare to the "traditional" structure of NN in any type of Neural Networks.<br />
<br />
8. The key idea is to treat a neural network as a relational graph and investigate the messages exchange over the graphs. It will be interesting to discuss how much improvement a sweet spot can bring to the network.<br />
<br />
9. It is interesting to see how figure "b" in the plots showing the experiment results has a U-shape. This shows a correlation between message exchange efficiency and capability of learning distributed representations, giving an idea about the tradeoff between them, as indicated in the paper.<br />
<br />
10. From Results and Discussion part, we can find learning at a very abstract level of artificial neurons and biological neurons is similar. But the learning method is different. Artificial neural networks use gradient descent to minimize the loss function and reach the global minimum. Biological neural networks have another learning method.<br />
<br />
11. It would be interesting to see if the insights from this paper can be used to create new kinds of neural network architectures which were previously unknown, or to create a method of deciding which neural network to use for a given situation or data set. It may be interesting to compare the performance of neural networks by a property of their graph (like average path length as described in this paper) versus a property of the data (such as noisiness).<br />
<br />
12. It would be attractive if the paper dig more into the “sweet pot” and give us a brief introduction of it rather than just slightly mention it. It would be a little bit confused if this technical term shows up and without any previous mention or introduction<br />
<br />
13. The paper shows graphical results how well performing graphs cluster into ‘sweet spots’. And the ‘sweet spot’ leads to significant performance improvement of the graph. It would be interesting if the author could provide more justification for the experimental setup chosen and also explore if known graph theorems have any relation to this improved performance.<br />
<br />
14. GNN is really interesting, one application I can think of is it can be used for recommending system like Linkedin and Facebook. Since people who knows each other is really easy represented as graph.<br />
<br />
15. At the end of the paper, it could also talk about the applications of relational graph, the developed graph-based representation of neural networks. For example, structural scenarios that data has relational structure like atoms and molecules, and non-structural scenarios that data doesn’t have relational structure like images and texts.</div>Y27wenghttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classificationa fair comparison of graph neural networks for graph classification2020-11-09T21:32:26Z<p>Gsikri: /* Comparison with Published Results */</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
<br />
Experimental reproducibility in machine learning has been known to be an issue for some time. Researchers attempting to reproduce the results of old algorithms have come up short, raising concerns that lack of reproducibility hurts the quality of the field. Lack of open source AI code has only exacerbated this, leading some to go so far as to say that "AI faces a reproducibility crisis" [1]. It has been argued that the ability to reproduce existing AI code, and making these codes and new ones open source is a key step in lowering the socio-economic barriers of entry into data science and computing. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce the results from such experiments to tackle the problem of ambiguity in experimental procedures and the impossibility of reproducing results. They also Standardized the experimental environment so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models that take graph-structured data as input and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbors. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
They provide a convenient way for node level, edge level, and graph level prediction task. The intuition of GNN is that nodes are naturally defined by their neighbors and connections. A typical application of GNN is node classification. Essentially, every node in the graph is associated with a label, and we want to predict the label of the nodes without ground-truth.<br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' (a) is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' (b) is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' (c) is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
<gallery mode="packed"><br />
Image:bipartite.png| (a) Bipartite Graph<br />
Image:path.gif| (b) Path Graph<br />
Image:cycle.png| (c) Cycle Graph<br />
</gallery><br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assessment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross-Validation.<br />
As the model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyperparameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Overview of Reproducibility Issues==<br />
The paper explores five different GNN models exploring issues with their experimental setup and potential reproducibility. <br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
===Issues with DGCNN (Deep Graph Convolutional Neural Network)===<br />
The authors of DGCNN use a faulty method of tuning the learning rate and epoch. They used only a single fold for tuning hyperparameters despite evaluating the model on a 10-fold CV. This potentially leads to suboptimal performance. They haven't released the code for the experiments. Lastly, they average the one-fold CV across 10 folds and then report the numbers. This also reduces variance.<br />
<br />
=== Issues with DiffPoll === <br />
It has not been clearly stated in the paper whether the results come from a test set or if they come from a validation set. Moreover, the standard deviation over the 10-fold CV has also not been reported. Due to no random seeds, different data splits are there while performing multi-fold splits (without stratification).<br />
<br />
=== Issue with ECC ===<br />
The results of the paper do not report the standard deviation obtained during the 10-fold Cross-Validation. Like in the case of GDCNN, the model selection procedure is not made clear due to pre-determined hyper-parameters. The code repository is not available as well.<br />
<br />
=== Issues with GIN === <br />
Instead of reporting the test accuracy, the authors have given the validation accuracy over the 10-fold CV. Therefore, the given results are not suitable for evaluating the model. Code repository is not available for selecting the model.<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
a subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
===Effect of Node Degree on Layering===<br />
[[File:Paper11_NodeDegree.png]]<br />
<br />
The above table displays the median number of selected layers in relation to the addition of node<br />
degrees as input features on all social datasets. 1 indicates that an uninformative feature is used as<br />
a node label.<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|center|700px|Image: 700pixels]]<br />
<br />
<br />
In the above figure, we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in the literature. In addition to that, the validation results across the 10 different model selections were plotted. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent. Also, the average validation accuracies calculated by the authors are always higher or equal to the test results.<br />
<br />
== Source Codes ==<br />
The data and scripts to reproduce the experiments reported in the paper are available at https://github.com/diningphil/gnn-comparison .<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that the comparison between proposed methods becomes clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
Also as well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.<br />
<br />
- Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5165–5175. Curran Associates, Inc., 2018.<br />
<br />
[1] Hutson, M. (2018). Artificial intelligence faces a reproducibility crisis. Science, 359(6377), 725–726.</div>J6bhatiahttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_GeneralizationMeta-Learning For Domain Generalization2020-11-09T19:49:02Z<p>Gsikri: /* Mountain Car */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
This paper proposes a novel meta-learning method for domain generalization. Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
Let <math> S </math> and T be source and target domains in the DG setting, respectively. We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
This supervised learning methodology has been schematically in figure 1.<br />
<br />
[[File:supervisedd.png |center]]<br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.<br />
<br />
[[File:ashraf3.jpg |center|500px|Image: 500 pixels]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
The PACS multi-domain recognition benchmark is exploited to address the object detection task; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
[[File:CaptureMeta.PNG |center|500px|Image: 500 pixels]]<br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Source Code ==<br />
<br />
Four different implementations of this paper are publicly available at [https://paperswithcode.com/paper/learning-to-generalize-meta-learning-for#code link MLDG]<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>Pashrafihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning-For-Domain_GeneralizationMeta-Learning-For-Domain Generalization2020-11-09T19:47:18Z<p>Pashrafi: Created page with "== Presented by == Parsa Ashrafi Fashi == Introduction == Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when te..."</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extract a domain agnostic component domains and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning which have been widely adapted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Here by defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
<br />
== Previous Work ==<br />
<br />
<br />
== Method ==<br />
<br />
<br />
== Experiments ==<br />
<br />
<br />
== Results ==<br />
<br />
<br />
== Conclusion ==<br />
<br />
<br />
== Critiques ==<br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>Pashrafihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_ClassificationExtreme Multi-label Text Classification2020-11-09T10:40:54Z<p>P2torabi: </p>
<hr />
<div>== Presented By ==<br />
Mohan Wu<br />
<br />
== Introduction ==<br />
In this paper, the authors are interested a field of problems called extreme classification. These problems involve training a classifier to give the most relevant tags for any given text. The difficulties arise from the fact that the label set is so large so that most models would fail. The authors propose a new model called APLC-XLNet which fine-tunes the generalized autoregressive pre-trained model (XLNet) by using Adaptive Probabilistic Label Clusters (APLC) to calculate cross-entropy loss. This method takes advantage of unbalanced label distributions by forming clusters to reduce training time. The authors experimented on five different datasets and showed that their method far outweighs the existing state-of-the-art models.<br />
<br />
== Motivation ==<br />
Extreme multi-label text classification (XMTC) has applications in many recent problems such as providing word representations of a large vocabulary [1], tagging Wikipedia articles with relevant labels [2], and giving product descriptions for search advertisements [3]. The authors are motivated by the shortcomings of traditional methods in the creation of XMTC. For example, one such method of classifying text is the bag-of-words (BOW) approach where a vector represents the frequency of a word in a corpus. However, BOW does not consider the location of the words so it cannot determine context and semantics. Motivated by the success of transfer learning in a wide range of natural language processing (NLP) problems [10], the authors propose to adapt XLNet [4] to the XMTC problem. The final challenge is that the labeling distribution can be very sparse for some labels. The authors solve this problem by combining the Probabilistic Label Tree [5] method and the Adaptive Softmax [6] to propose APLC.<br />
<br />
== Related Work ==<br />
Two approaches have been proposed to solve the XMTC problem: traditional BOW techniques, and modern deep learning models.<br />
<br />
=== BOW Approaches ===<br />
The traditional BOW approach contains three different approaches: one-vs-all approaches, embedding-based approaches, and tree-based approaches.<br />
<br />
Intuitively, researchers can apply the one-vs-all approach in which they can fit a classifier for each label and thus XMTC reduces to a binary classification problem. This technique has been shown to achieve high accuracy; however, due to a large number of labels, this approach is very computationally expensive. There have been some techniques to reduce the complexity by pruning weights (PDSparse) to induce sparsity, however this method is quite expensive. PPDSparse was introduced to parallelize PDSparse in a large-scaled distributed setting. Additionally, DiSMEC was introduced as another extension to PDSparse that allows for a distributed and parallel training mechanism which can be done efficiently by a novel negative sampling technique in addition to the pruning of spurious weights. Another approach is to simply apply a dimensional reduction technique on the label space. However, doing so has shown to have serious negative effects on the prediction accuracy due to loss of information in the compression phase. Finally, another approach is to use a tree to partition labels into groups based on similarity. This approach has shown to be quite fast but unfortunately, due to the problems with BOW methods, the accuracy is poor.<br />
<br />
In embedding-based approaches, the high-dimensional label space is projected into a low-dimensional one by exploiting label correlations and sparsity. Embedding methods may pay a heavy price in terms of prediction accuracy due to the loss of information in the compression phase.<br />
<br />
In tree-based methods, a hierarchical tree structure is learned to partition the instances or labels into different groups so that similar ones can be in the same group. The whole set, initially assigned to the root node, is partitioned into a fixed number of k subsets corresponding to k child nodes of the root node. The partition process is repeated until a stopping condition is satisfied on the subsets.<br />
<br />
=== Deep Learning Approaches ===<br />
Unlike BOW approaches, deep learning can learn dense representations of the corpus using context and semantics. One such example is X-BERT[7] which divides XTMC problems into 3 steps. First, it partitions the label set into clusters based on similarity. Next, it fits a BERT model on the label clusters for the given corpus. Finally, it trains simple linear classifiers to rank the labels in each cluster. Since there are elements of traditional techniques in X-BERT, namely, the clustering step and the linear classifier step, the authors propose to improve upon this approach.<br />
<br />
== APLC-XLNet ==<br />
APLC-XLNet consists of three parts: the pre-trained XLNet-Base as the base, the APLC output layer, and a fully connected hidden layer connecting the pooling layer of XLNet as the output layer, as it can be seen in figure 1. To recap, XLNet [8] is a generalized autoregressive pretraining language model based upon capturing bidirectional contexts through maximizing the expected likelihood over all permutations of the factorization.<br />
<br />
XLNet is an autoregressive pretraining language model that captures bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. It also adopts the Transformer XL as its base architecture. It is pertained on a large corpus of text data, then it is fine-tuned for downstream tasks, which is typical of previous deep learning approaches.<br />
<br />
[[File:Capture1111.JPG |center|600px]]<br />
<br />
<div align="center">Figure 1: Architecture of the proposed APLC-XLNet model. V denotes the label cluster in APLC.</div><br />
<br />
the order which can be expressed as follows:<br />
<br />
\begin{equation}<br />
\underset{\theta}{\max}{ E_{z \sim Z_T}[\sum_{t=1}^{T} \log {p_{\theta}(x_{z_{t}}|x_{z<t})}] }<br />
\end{equation}<br />
<br />
where <math> \theta </math> denotes the parameters of the model, <math> Z_T </math> is the set of all possible permutations of the sequence with length T,<br />
<math> z_t </math> is the <math> t </math>-th element and <math> z < t </math> deontes the preceding <math> t-1 </math> elements in a permutation <math> z</math>.<br />
<br />
One major challenge in XTMC problems is that most data fall into a small group of labels. To tackle this challenge, the authors propose partitioning the label set into one head cluster, <math> V_h </math>, and many tail clusters, <math> V_1 \cup \ldots \cup V_K </math>. The head cluster contains the most popular labels while the tail clusters contain the rest of the labels. The clusters are then inserted in a 2-level tree where the root node is the head cluster and the leaves are the tail clusters. Using this architecture improves computation times significantly since most of the time the data stops at the root node.<br />
<br />
[[File:Paper21_ClusterArchitecture.PNG |center|600px]]<br />
<br />
<div align="center">Figure 2: Architecture of the 3-cluster APLC. h denotes the hidden state. Vh denotes the head cluster. V1 and V2 denote the two tail clusters.</div><br />
<br />
The authors define the probability of each label as follows:<br />
<br />
\begin{equation}<br />
p(y_{ij} | x) = <br />
\begin{cases} <br />
p(y_{ij}|x) & \text{if } y_{ij} \in V_h \\<br />
p(V_t|x)p(y_{ij}|V_t,x) & \text{if } y_{ij} \in V_t<br />
\end{cases}<br />
\end{equation}<br />
<br />
where <math> x </math> is the feature of a given sample, <math> y_{ij} </math> is the j-th label in the i-th sample, and <math> V_t </math> is the t-th tail cluster. Let <math> Y_i </math> be the set of labels for the i-th sample, and define <math> L_i = |Y_i| </math>. The authors propose an intuitive objective loss function for multi-label classification:<br />
<br />
\begin{equation}<br />
J(\theta) = -\frac{1}{\sum_{i=1}^N L_i} \sum_{i=1}^N \sum_{j \in Y_i} (y_{ij} logp(y_{ij}) + (1 - y_{ij}) log(1- p(y_{ij}))<br />
\end{equation}<br />
where <math>N</math> is the number of samples, <math>p(y_{ij})</math> is defined above and <math> y_{ij} \in \{0, 1\} </math>.<br />
<br />
The number of parameters in this model is given by:<br />
\begin{equation}<br />
N_{par} = d(l_h + K) + \sum_{i=1}^K \frac{d}{q^i}(d+l_i)<br />
\end{equation}<br />
where <math> d </math> is the dimension of the hidden state of <math> V_h </math>, <math> q </math> is a decay variable, <math> l_h = |V_h| </math> and <math> l_i = |V_i| </math>. Furthermore, the computational cost can be expressed as follows:<br />
\begin{align}<br />
C &= C_h + \sum_{i=1}^K C_i \\<br />
&= O(N_b d(l_h + K)) + O(\sum_{i=1}^K p_i N_b \frac{d}{q^i}(l_i + d))<br />
\end{align}<br />
where <math> N_b </math> is the batch size. <br />
<br />
Note that in practical settings where <math>K\ll l_h</math> and <math>d\ll l_i</math> the computational cost can be rewritten as <math>O(N_b d(l_h + \sum_{i=1}^Kp_i\frac{l_i}{q_i}))</math>. Furthermore, the authors pointed out that, assuming all of <math>l_i</math> are fixed, the computational complexity can be reduced by configuring the model so that <math>p_i</math> is small, which corresponds to a low probability of a batch entering the tail cluster <math>i</math>, as the computational burden is dominated by the cost of entering tail clusters.<br />
<br />
=== Training APLC-XLNet ===<br />
Training APLC-XLNet essentially boils down to training its three parts. The authors suggest using discriminative fine-tuning method [9] to train the model entirely while assigning different learning rates to each part. Since XLNet is pretrained, the learning rate, <math> \eta_x </math>, should be small, while the output layer is specific to this type of problem so its learning rate, <math> \eta_a </math>, should be large. For the connecting hidden layer, the authors chose a learning rate, <math> \eta_h </math>, such that <math> \eta_x < \eta_h < \eta_a </math>. For each of the learning rates, the authors suggest a slanted triangular learning schedule[9] defined as:<br />
\begin{equation}<br />
\eta =<br />
\begin{cases} <br />
\eta_0 \frac{t}{t_w} & \text{if } t \leq t_w \\<br />
\eta_0 \frac{t_a - t}{t_a - t_w} & \text{if } t > t_w<br />
\end{cases}<br />
\end{equation}<br />
where <math> \eta_0 </math> is the starting learning rate, <math> t </math> is the current step, <math> t_w </math> is the chosen warm-up threshold and <math> t_a </math> is the total number of steps. The objective here is to motivate the model to converge quickly to the suitable space at the beginning and then refine the parameters. Learning rates are first increased linearly, and then decayed gradually according to the strategy.<br />
<br />
<br />
[[File:CaptureMulti.PNG|800px|center]]<br />
== Results ==<br />
The authors tested the APLC-XLNet model in several benchmark datasets against current state-of-the-art baselines, including two embedding based models, SLEEC and AnnexML, the 1-vs-all model DisMEC, three tree-based approaches, PfastreXML, Parabel and Bonsai, and two deep learning approaches, XML-CNN and AttentionXML. The evaluation metric, P@k is defined as:<br />
\begin{equation}<br />
P@k = \frac{1}{k} \sum_{i \in rank_k(\hat{y})} y_i<br />
\end{equation}<br />
where <math> rank_k(\hat{y}) </math> is the top k ranked probability in the prediction vector, <math> \hat{y} </math>.<br />
<br />
As shown in the table below, APLC-XLNet outperforms the two embedding based models, SLEEC and AnnexML, across all five datasets. It outperforms DisMEC on three datasets except for Wiki-500k and Amazon-670k. Among the three tree-based methods, Bonsai performs the best. And APLC-XLNet outperforms Bonsai on three datasets. At last, APLC-XLNet also outperforms the deep learning methods on four datasets. <br />
[[File:Paper.PNG|1000px|]]<br />
<br />
A SentencePiece tokenizer is used to preprocess raw text. The sentence is split up into small tokens as words. The following table (2) represents APLC's implementation details. <br />
[[File:Table2 paper.png|500px|center]]<br />
<br />
To help with tuning the model in the number of clusters and different partitions, the authors experimented on two different datasets: EURLex and Wiki10.<br />
<br />
[[File:XTMC2.PNG|1000px|]]<br />
<br />
The three different partitions in the second graph the authors used were (0.7, 0.2, 0.1), (0.33, 0.33, 0.34), and (0.1, 0.2, 0.7) where 3 clusters were fixed.<br />
<br />
== Source Code ==<br />
<br />
The Source Code for the paper can be found here: https://github.com/huiyegit/APLC_XLNet. It is based on the very popular and well-maintained NLP library Hugging Face (https://huggingface.co/) which contains collections of pre-training NLP models and techniques.<br />
<br />
== Conclusion ==.<br />
The authors have proposed a new deep learning approach to solve the XMTC problem based on XLNet, namely, APLC-XLNet. APLC-XLNet consists of three parts: the pretrained XLNet that takes the input of the text, a connecting hidden layer, and finally an APLC output layer to give the rankings of relevant labels. Their experiments show that APLC-XLNet has better results in several benchmark datasets over the current state-of-the-art models.<br />
<br />
== Critques ==<br />
The authors chose to use the same architecture for every dataset. The model does not achieve state-of-the-art performance on larger datasets. Perhaps, a more complex model in the second part of the model could help achieve better results. The authors also put a lot of effort into explaining the model complexity for APLC-XLNet but do not compare it to other state-of-the-art models. A table of model parameters and complexity for each model could be helpful in explaining why their techniques are efficient. <br />
<br />
In their implementation, the label set was partitioned into clusters in a heuristic manner by experimenting with different numbers of clusters and partition proportions. There might be a better model-based method to choose the partition structure.<br />
<br />
== References ==<br />
[1] Mikolov, T., Kombrink, S., Burget, L., Cernock ˇ y, J., and `<br />
Khudanpur, S. Extensions of recurrent neural network<br />
language model. In 2011 IEEE International Conference<br />
on Acoustics, Speech and Signal Processing (ICASSP),<br />
pp. 5528–5531. IEEE, 2011. <br />
<br />
[2] Dekel, O. and Shamir, O. Multiclass-multilabel classification with more classes than examples. In Proceedings<br />
of the Thirteenth International Conference on Artificial<br />
Intelligence and Statistics, pp. 137–144, 2010. <br />
<br />
[3] Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss<br />
functions for recommendation, tagging, ranking & other<br />
missing label applications. In Proceedings of the 22nd<br />
ACM SIGKDD International Conference on Knowledge<br />
Discovery and Data Mining, pp. 935–944. ACM, 2016. <br />
<br />
[4] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li,<br />
M., and Lin, J. End-to-end open-domain question answering with BERTserini. In NAACL-HLT (Demonstrations),<br />
2019a. <br />
<br />
[5] Jasinska, K., Dembczynski, K., Busa-Fekete, R.,<br />
Pfannschmidt, K., Klerx, T., and Hullermeier, E. Extreme f-measure maximization using sparse probability<br />
estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016. <br />
<br />
[6] Grave, E., Joulin, A., Cisse, M., J ´ egou, H., et al. Effi- ´<br />
cient softmax approximation for gpus. In Proceedings of<br />
the 34th International Conference on Machine LearningVolume 70, pp. 1302–1310. JMLR.org, 2017 <br />
<br />
[7] Wei-Cheng, C., Hsiang-Fu, Y., Kai, Z., Yiming, Y., and<br />
Inderjit, D. X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder Representations from<br />
Transformers. In NeurIPS Science Meets Engineering of<br />
Deep Learning Workshop, 2019.<br />
<br />
[8] Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." <br />
Advances in neural information processing systems. 2019.<br />
<br />
[9] Howard, J. and Ruder, S. Universal language model finetuning for text classification. In Proceedings of the 56th<br />
Annual Meeting of the Association for Computational<br />
Linguistics (Volume 1: Long Papers), pp. 328–339, 2018<br />
<br />
[10] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1 (Long and Short Papers), pp. 4171–4186, 2019.</div>Mhwuhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_CertificatesBreaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-09T00:20:37Z<p>A227jain: removed unwanted symbol</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. Mathematically, such an adversarial example <math>x'</math> satisfies <math>distance(x,x')=\delta, f(x)\neq f(x')</math>, where <math>\delta</math> is some small number and <math>f(\cdot)</math> is the image label. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minimally augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but these defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an inputted image in which the classification remains constant within a bounded set of perturbations around the original inputted image. Certified defenses have thus far considered <math>l_\text{p}</math>-bounded attacks where after labelling an input, if there does not exists an image resulting in a different label that is within the <math>l_\text{p}</math> norm ball of radius <math>\epsilon</math>, centred at the original input, then a certificate is issued. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The statement made by the certificate (i.e., that the input image is not an adversarial example in the chosen norm) is still technically correct, however in this case the adversary is hiding behind a certificate to avoid detection by a certifiable defense. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known Projected Gradient Descent (PGD) attack. PGD, a universal first-order adversary [4], is the only method to greatly improve NN model robustness among all the defenses appearing in ICLR2018 and CVPR2018 [5]. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
[[File:geese_attack.png|center]]<br />
<div align="center">'''Figure 1:''' Shadow attack used to build an adversarial example to target smoothed ImageNet classifier which results in a large certified radii. Even the adversarial perturbation is natural and smooth looking, it's measurement using lp-metrics is quite large. </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
== Source Code == <br />
<br />
The source code of the paper can be found here https://github.com/AminJun/BreakingCertifiableDefenses .<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in the norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs, and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.<br />
<br />
[2] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
[3] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.</div>Gsikrihttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Task_Understanding_from_Confusing_Multi-task_DataTask Understanding from Confusing Multi-task Data2020-11-08T20:55:49Z<p>M59jiang: /* Multi-label learning */</p>
<hr />
<div>'''Presented By'''<br />
<br />
Qianlin Song, William Loh, Junyue Bai, Phoebe Choi<br />
<br />
= Introduction =<br />
<br />
Narrow AI is an artificial intelligence that outperforms humans in a narrowly defined task. The application of Narrow AI is becoming more and more common. For example, Narrow AI can be used for spam filtering, music recommendation services, assisting doctors to make data-driven decisions, and even self-driving cars. One of the most famous integrated forms of Narrow AI is Apple's Siri. Siri has no self-awareness or genuine intelligence, and hence often has challenges performing tasks outside its range of abilities. However, the widespread use of Narrow AI in important infrastructure functions raises some concerns. Some people think that the characteristics of Narrow AI make it fragile, and when neural networks can be used to control important systems (such as power grids, financial transactions), alternatives may be more inclined to avoid risks. While these machines help companies improve efficiency and cut costs, the limitations of Narrow AI encouraged researchers to look into General AI. <br />
<br />
General AI is a machine that can apply its learning to different contexts, which closely resembles human intelligence. This paper attempts to generalize the multi-task learning system that learns from data from multiple classification tasks. For an isolated and very difficult task, the artificial intelligence may not learn it very well. For instance, a net with pixel dimension 1000*1000 is less likely to identify complicated objects in real-world situations on the time basis. However, if it could be learned simultaneously, it would be better as the tasks can share what they learned. It is easier for the learner to learn together instead of in isolation, for example, shapes, landmarks, textures, orientation and so on. This is called Multitask Learning. One application is image recognition. In figure 1, an image of an apple corresponds to 3 labels: “red”, “apple” and “sweet”. These labels correspond to 3 different classification tasks: color, fruit, and taste. <br />
<br />
[[File:CSLFigure1.PNG | 500px]]<br />
<br />
Currently, multi-task machines require researchers to construct a task definition. Otherwise, it will end up with different outputs with the same input values. R