http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Mdadbin&feedformat=atomstatwiki - User contributions [US]2024-03-29T15:37:31ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=48876Model Agnostic Learning of Semantic Features2020-12-02T14:09:08Z<p>Mdadbin: </p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing a similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
Also known as "learning to learn", Model-agnostic Meta-Learning is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test sets, the updated weights are calculated using the algorithm below.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y}) = - \sum_{c}1[y=C]log(\hat{y}_{c})</math><br />
</div><br />
<br />
Although the task loss is a decent predictor, nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. This issue is considered in other loss terms.<br />
<br />
===Model-Agnostic Learning with Episodic Training===<br />
The key of their learning procedure is an episodic training scheme, originated from model-agnostic meta-learning, to expose the model optimization to distribution mismatch. In line with their goal of domain generalization, the model is trained on a sequence of simulated episodes with domain shift. Specifically, at each iteration, the available domains <math>D</math> are randomly split into sets of meta-train <math>D_{tr}</math> rand meta-test <math>D_{te}</math> domains. The model is trained to semantically perform well on held-out <math>D_{te}</math> after being optimized with one or more steps of gradient descent with <math>D_{tr}</math> domains. In our case, the feature extractor’s and task network’s parameters,ψ and θ, are first updated from the task-specific supervised loss <math>L</math> task(e.g. cross-entropy for classification), computed on meta-train:<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. These relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimizing symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symmetry.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|600px|center]]<br />
<br />
The training architecture and three losses are also illustrated as below:<br />
<br />
[[File:Ashraf99.png|800px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. This dataset contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:Figure2 image.png|600px|center]]<br />
<div align="center">t-SNE visualizations of extracted features.</div> <br />
<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, the authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. The authors attempted to segment the brain images into four classes: background, grey matter, white matter, and cerebrospinal fluid. Tasks such as these have enormous impact in clinical diagnosis and aiding in treatment. For example, designing a similar net to segment between healthy brain tissue and tumorous brain tissue could aid surgeons in brain tumour resection.<br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The power and effectiveness of this method have been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code is publicly available at [19]. As future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== Critiques ==<br />
<br />
The purpose of this paper is to help guide learning in semantic feature space by leveraging local similarity. The authors argument may contain essential domain-independent general knowledge for domain generalization to solve this issue. In addition to adopting constructive loss and triplet loss to encourage the clustering for solving this issue. Extracting robust semantic features regardless of domains can be learned by leveraging from the across-domain class similarity information, which is important information during learning. The learner would suffer from indistinct decision boundaries if it could not separate the samples from different source domains with separation on the domain invariant feature space and in-dependent class-specific cohesion. The major problem that will be revealed with large datasets is that these indistinct decision boundaries might still be sensitive to the unseen target domain.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=48875Dense Passage Retrieval for Open-Domain Question Answering2020-12-02T14:07:20Z<p>Mdadbin: </p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open-domain question answering is a task that finds question answers from a large collection of documents. Nowadays open-domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= Background =<br />
The following example clearly shows what problems open-domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader. <br />
<br />
=== Mathematical Formulation ===<br />
Let's assume that we have a collection of <math> D </math> documents <math>d_1, d_2, \cdots, d_D </math> where each <math>d</math> is split into text passages of equal lengths as the basic retrieval units. Then the total number of passages is <math> M </math> in the corpus <math>\mathcal{C} = \{p_1, p_2, \ldots, p_{M}\} </math>. Each passage <math> p_i </math> is composed of a sequence of tokens <math> w^{(i)}_1, w^{(i)}_2, \cdots, w^{(i)}_{|p_i|} </math>. The task is to find a span of tokens <math> w^{(i)}_s, w^{(i)}_{s+1}, \cdots, w^{(i)}_{e} </math> from one of the passages <math> p_i </math> to answer the given question <math>q_i</math> . The corpus can contain millions of documents just to cover a wide variety of domains (e.g. Wikipedia) or even billions of them (e.g. the Web).<br />
The open-domain QA systems define the retriever as a function <math> R:(q,\mathcal{C}) \rightarrow \mathcal{C_F} </math> that takes as input a question <math>q</math> and a corpus <math>\mathcal{C}</math> and then returns a much smaller filter set of texts <math>\mathcal{C_F} \subset \mathcal{C}</math> , where <math>|\mathcal{C_F}| = k \ll |\mathcal{C}|</math>. The retriever can be evaluated by '''top-k retrieval accuracy''' that represents the fraction of relevant answers contained in <math>\mathcal{C_F}</math>.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model that encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. The authors used the following similarity metric between the question and the passage:<br />
\begin{equation}<br />
sim(q,p) = E_Q(q)^TE_P(p)<br />
\end{equation}<br />
<br />
Of course, this is not the only possible metric; any similarity metric that is decomposable is sufficient. Many of these are some sort of transformation of the Euclidean distance. The authors experimented with other similarity metrics but found comparable results so they decided to use the simplest to focus on learning better encoders.<br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
During the inference time, they used FAISS (Johnson et al., 2017) for the similarity search, which can be efficiently applied on billions of dense vectors<br />
<br />
== Training ==<br />
The goal in training the encoders is to create a vector space where the relevant pairs of question and passages (positive passages) have a smaller distance than the irrelevant ones (negative passages) which is a metric learning problem.<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log-likelihood of the positive passages.<br />
\begin{eqnarray}<br />
&& L(q_i, p^+_i, p^-_{i,1}, \cdots, p^-_{i,n}) \label{eq:training} \\<br />
&=& -\log \frac{ e^{\mathrm{sim}(q_i, p_i^+)} }{e^{\mathrm{sim}(q_i, p_i^+)} + \sum_{j=1}^n{e^{\mathrm{sim}(q_i, p^-_{i,j})}}}. \nonumber<br />
\end{eqnarray}<br />
While positive passage selection is simple, where the passage containing the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query <math>q_i</math>, positive passage <math>p_i</math>) in a mini-batch, then the negative passages for query <math>q_i</math> are the passages <math>p_i</math> when <math>j</math> is not equal to <math>i</math>.<br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents, i.e. they removed semi-structured data, such as tables, info- boxes, lists, as well as the disambiguation pages, and then splitted each document into passages of length 100 words. These passages form a candidate pool. The five QA datasets described below were used to build the train data. <br />
# '''Natural Questions (NQ)''' (Kwiatkowski et al., 2019): This was designed for the end-end question-answering tasks where the questions are Google search queries and the answers are annotated texts from the Wikipedia.<br />
# '''TriviaQA''' (Joshi et al., 2017): It is a collection of trivia questions and answers scraped from the Web.<br />
# '''WebQuestions (WQ)''' (Berant et al., 2013): It consists of questions selected using Google Search API where the answers are entities in Freebase<br />
# '''CuratedTREC (TREC)''' (Baudisˇ and Sˇedivy`, 2015): It is intended for the open-domain QA from unstructured corpora. It consists of TREC QA tracks and questions from other Web sources.<br />
# '''SQuAD v1.1''' (Rajpurkar et al., 2016): It is a popular benchmark dataset for reading comprehension. The annotators were given a paragraph from the Wikipedia and were supposed to write questions which answers could be found in the given text.<br />
To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px | center]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
<br />
The table below compares the top-k (for k=20 or k=100) accuracies of different retrieval systems on various popular QA datasets. Top-k accuracy is the percentage of examples in the test set for which the correct outcome occurs in the k most likely outcomes as predicted by the network. As can be seen, DPR consistently outperforms BM25 on all datasets, with the exception of SQuAD. Additionally, DPR tends to perform particularly well when a smaller k value in chosen. The authors speculated that the lower performance of DPR on SQuAD was for two reasons inherent to the dataset itself. First, the SQuAD dataset has high lexical overlap between passages and questions. Second, the dataset is biased because it is collected from a small number of Wikipedia articles - a point which has been argued by other researchers as well.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px | center ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px | center]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss.<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px | center]]<br />
<br />
=== Reader for End-to-End ===<br />
<br />
The reader would score the subset of <math>k</math> passages which were retrieved using DPR. Next, the reader would extract an answer span from each passage and assign a span score. The best span from the passage with the highest selection score is chosen as the final answer. To calculate these scores linear layers followed by a softmax are applied on top of the BERT output as follows:<br />
<br />
<math>\textbf{P}_{start,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (1)<br />
<br />
<math>\textbf{P}_{end,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (2)<br />
<br />
<math>\textbf{P}_{selected}= softmax(\hat{\textbf{P}}^T_i \textbf{w}_{selected}) \in \mathbb{R}^k </math> where <math> k </math> is the number of passages. (3)<br />
<br />
Here <math> \textbf{P}_i \in \mathbb{R}^{L \times h}</math> used in (1) and (2), <math>i</math> is one of the <math> k </math> passages and <math> h </math> is the hidden layer dimension of BERT's output. Additionally, <math> \hat{\textbf{P}} = [\textbf{P}^{[CLS]}_1,...,\textbf{P}^{[CLS]}_k] \in \mathbb{R}^{h \times k}</math>. Here <math> \textbf{w}_{start},\textbf{w}_{end},\textbf{w}_{selected} </math> are learnable vectors. The span score for <math> s </math>-th starting word to the <math> t</math>-th ending word of the <math> i </math>-th passage then can be calculated as <math> P_{start,i}(s) \times P_{end,i}(t) </math> where <math> s,t </math> are used to select the corresponding index of <math> \textbf{P}_{start,i}, \textbf{P}_{end,i} </math> respectively. The passage selection score for the <math> i</math>-th passage can be computed similarly, by indexing the <math>i</math>-th element of <math>\textbf{P}_{selected}</math>.<br />
<br />
During training, from the top 100 passages, one positive and m-1 negative passages. m=24 is the hyperparameter used in all experiments. The marginal log-likelihood is maximized as the training objective. Batch size of 16 is used for large (NQ, TriviaQA, SQuAD) and 4 for small (TREC, WQ) datasets.<br />
<br />
= Source Code =<br />
<br />
<br />
The official code for this paper is freely available at <span class="plainlinks">[https://github.com/facebookresearch/DPR "official repository"]</span><br />
<br />
There are also other repositories here: 1- <span class="plainlinks">[https://github.com/huggingface/transformers transformers]</span> 2- <span class="plainlinks">[https://github.com/huggingface/transformers haystack]</span><br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
= Critiques =<br />
<br />
The bag of words method is an old out-dated retriever approach. The paper could have exploited results from other state-of-the-art algorithms (e.g Golden retriever) and compared the proposed method with them.<br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=48874Functional regularisation for continual learning with gaussian processes2020-12-02T14:05:37Z<p>Mdadbin: </p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularize Continual Learning (CL) so that it doesn't forget previously learned tasks. This method, referred to as functional regularization for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then utilized in optimization as a regularizer to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimization of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularization-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularize the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularizing weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimized using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularization-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularizes the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinte-dimensional generalization of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimization in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterized by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrized by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarize information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrize <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrized by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimizing the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximizing the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularize the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximized is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
As a result, we regularize the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimization computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularization term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrized by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. The round dots represents the data points and each color corresponds to a different label.<br />
<br />
[[File:inducing-points.jpg|500px]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasize that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimized using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularization-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimization scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularization terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularization terms can make optimization more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta&diff=48873Roberta2020-12-02T13:56:26Z<p>Mdadbin: </p>
<hr />
<div>= RoBERTa: A Robustly Optimized BERT Pretraining Approach =<br />
== Presented by ==<br />
Danial Maleki<br />
<br />
== Introduction ==<br />
Self-training methods in the NLP domain(Natural Language Processing) like ELMo[1], GPT[2], BERT[3], XLM[4], and XLNet[5] have shown significant improvements, but it remains challenging to determine which parts the methods have the most contribution. This paper proposed Roberta, which replicates BERT pretraining, and investigates the effects of hyperparameters tuning and training set size. In summary, the main contributions of this paper are (1) modifying some BERT design choices and training schemes. (2) a new set of datasets. These 2 modification categories improve performance on downstream tasks.<br />
<br />
== Background ==<br />
In this section, they tried to have an overview of BERT as they used this architecture. In short terms, BERT uses transformer architecture[6] with 2 training objectives; they use masks language modeling (MLM) and next sentence prediction (NSP) as their objectives. The MLM objective randomly selects some of the tokens in the input sequence and replaces them with the special token [MASK]. Then they try to predict these tokens based on the surrounding information. NSP employs a binary classification loss for the prediction of whether the two sentences are adjacent to each other or not. It is noteworthy that, while selecting positive next sentences is trivial, generating negative ones is often much more difficult. Originally, they used a random next sentence as a negative. They used Adam optimizer with some specific parameters to train the network. They used different types of datasets to train their networks. Finally, they did some experiments on the evaluation tasks such as GLUE[7], SQuAD, RACE, and showed their performance on those downstream tasks.<br />
<br />
== Training Procedure Analysis == <br />
In this section, they elaborate on which choices are important for successfully pretraining BERT. <br />
<br />
=== Static vs. Dynamic Masking ===<br />
First, they discussed static vs. dynamic masking. As mentioned in the previous section, the masked language modeling objective in BERT pre-training masks a few tokens from each sequence at random and then predicts them. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps. To extend this single static mask, the authors duplicated training data 10 times so that each sequence was masked in 10 different ways. The model was trained on those data for 40 epochs, with each sequence with the same mask being used 4 times.<br />
Unlike static masking, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model. The results show that dynamic masking has slightly better performance in comparison to the static one.[[File:mask_result.png|400px|center]]<br />
<br />
=== Input Representation and Next Sentence Prediction ===<br />
Next, they tried to investigate the necessity of the next sentence prediction objective. They tried different settings to show it would help with eliminating the NSP loss in pretraining.<br />
<br />
<br />
<b>(1) Segment-Pair + NSP</b>: Each input has a pair of segments (segments, not sentences) from either the original document or some different document chosen at random with a probability of 0.5, and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (the maximum fixed sequence length for the BERT model). This is the input representation used in the BERT implementation.<br />
<br />
<br />
<b>(2) Sentence-Pair + NSP</b>: This is the same as the segment-pair representation but with pairs of sentences instead of segments. However, the total length of sequences here would be a lot less than 512. Hence, larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.<br />
<br />
<br />
<b>(3) Full-Sentences</b>: In this setting, they didn't use any kind of NSP loss. Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the sequence's length is at most 512.<br />
<br />
<br />
<b>(4) Doc-Sentences</b>: Similar to the full sentences setting, they didn't use NSP loss in their loss again. This is the same as Full-Sentences, with the difference that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, to solve this problem, they used some kind of padding to make all of the inputs the same length.<br />
<br />
In the following table, you can see each setting's performance on each downstream task as you can see the best result achieved in the DOC-SENTENCES setting with removing the NSP loss. However, the authors chose to use FULL-SENTENCES for convenience sake, since the DOC-SENTENCES results in variable batch sizes.<br />
[[File:NSP_loss.JPG|600px|center]]<br />
<br />
=== Large Batch Sizes ===<br />
The next thing they tried to investigate was the importance of the large batch size. They tried a different number of batch sizes and realized the 2k batch size has the best performance among the other ones. The below table shows their results for a different number of batch sizes. The result suggested the original BERT batch size was too small. The authors used 8k batch size in the remainder of their experiments. <br />
<br />
[[File:batch_size.JPG|400px|center]]<br />
<br />
=== Tokenization ===<br />
In Roberta, they use byte-level Byte-Pair Encoding (BPE) for tokenization compared to BERT, which uses character level BPE. BPE is a hybrid between character and word level modeling based on sub-word units. The authors also used a vocabulary size of 50k rather than 30k at the original BERT implementation did, this increases the total parameters by approximately 15M to 20M for the BERT based and BERT large respectively. This change actually results in slight degradation of end-task performance in some cases, however, the authors preferred having the ability to universally encode text without introducing the "unknown" token.<br />
<br />
== RoBERTa ==<br />
They claim that if they apply all the aforementioned modifications to the BERT and pre-trained the model on a larger dataset, they can achieve higher performance on downstream tasks. They used different types of datasets for their pre-training; you can see a list of them below.<br />
<br />
<b>(1) BookCorpus + English Wikipedia (16GB)</b>: This is the data on which BERT is trained.<br />
<br />
<b>(2) CC-News (76GB)</b>: The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.<br />
<br />
<b>(3) OpenWebText (38GB)</b>: Open Source recreation of the WebText dataset used to train OpenAI GPT.<br />
<br />
<b>(4) Stories (31GB)</b>: A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.<br />
<br />
In conjunction with the modifications of dynamic masking, full-sentences whiteout an NLP loss, large mini-batches, and a larger byte-level BPE.<br />
<br />
== Results ==<br />
[[File:dataset.JPG|600px|center]]<br />
<div align="center">'''Table 5:''' RoBERTa's developement set results during pretraining over more data (160GB from 16GB) and longer duration (100K to 300K to 500K steps). Results are accumulated from each row. </div><br />
<br />
RoBERTa has outperformed state-of-the-art algorithms in almost all GLUE tasks, including ensemble models. More than that, they compare the performance of the RoBERTa with other methods on the RACE and SQuAD evaluation and show their results in the below table.<br />
[[File:squad.JPG|400px|center]]<br />
[[File:race.JPG|400px|center]]<br />
<br />
Table 3 presents the results of the experiments. RoBERTa offers major improvements over BERT (Large). Three additional datasets are added with the original dataset with the original step numbers (100K). In total, 160GB is used for pretraining. Finally, RoBERTa is pretrained for far more steps. Steps were increased from 100K to 300K and then to 500K. Improvements were observed across all downstream tasks. With the 500K steps, XLNet(large) is also outperformed across most tasks.<br />
<br />
== Conclusion ==<br />
The results confirmed that employing large batches over more data along with longer training time improves the performance. In conclusion, they basically said the reasons why they make gains may be questionable, and if you decently pre-train BERT, you will achieve the same performances as RoBERTa.<br />
<br />
== The comparison at a glance ==<br />
<br />
[[File:comparison_roberta.png|500px|center|thumb|from [8]]]<br />
<br />
== Critique ==<br />
While the results are outstanding and appreciable (reasonably due to using more data and resources), the technical novelty contribution of the paper is marginally incremental as the architecture is largely unchanged from BERT.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/pytorch/fairseq/tree/master/examples/roberta RoBERTa]. The original repository for RoBERTa is in PyTorch. In case you are a TensorFlow user, you would be interested in [https://github.com/Yaozeng/roberta-tf PyTorch to TensorFlow] repository as well.<br />
<br />
==Refrences == <br />
[1] Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations.In North American Association for Computational Linguistics (NAACL).<br />
<br />
[2] Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.<br />
<br />
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).<br />
<br />
[4] Guillaume Lample and Alexis Conneau. 2019. Cross lingual language model pretraining. arXiv preprint arXiv:1901.07291.<br />
<br />
[5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.<br />
<br />
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems.<br />
<br />
[7] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR).<br />
<br />
[8] BERT, RoBERTa, DistilBERT, XLNet - which one to use?<br />
Suleiman Khan<br />
[https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8/ link]</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48872BERTScore: Evaluating Text Generation with BERT2020-12-02T13:55:11Z<p>Mdadbin: </p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48871Time-series Generative Adversarial Networks2020-12-02T13:53:03Z<p>Mdadbin: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup> where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>, and Actor-critic methods <sup>[5]</sup> that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here lambda >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy, and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48870CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-02T13:50:48Z<p>Mdadbin: /* Previous Work */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In the recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
* Generative models: Generative Adversarial Networks (GANs), learn to generate images in adversarial manner. They consist of a generator network which maps noise samples to image samples and a discriminator network whose task is to distinguish the fake images from the real ones. These two are trained together until the point where the fake images are indistinguishable. A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. <br />
* In RotNet method [3], images are rotated and the CNN learns to figure out the direction. Therefore, this task is a 4-way classification task. Most images are taken upright which could be considered as labeled images with label 90 degrees. The authors of RotNet argue that the concept of 'upright' is hard to understand and requires high-level knowledge about the image, so this task encourages the network to discover more complex information about the images. <br />
* DeepCluster [4] alternates between k-means clustering step, in which pseudo-labels are assigned to the data by k-means on the PCA-reduced features, and the learning step in which the model tries to learn to fit the representation to these labels(cluster IDs) under several image transformations. These transformations include random resized crops with <math> \beta = 0.08 </math> and <math> \gamma = \frac{3}{4}</math> and horizontal flips.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<div align="center">'''Table :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conducted interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet?<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48869CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-02T13:37:27Z<p>Mdadbin: /* Previous Work */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In the recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
* A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. <br />
* In RotNet method [3], images are rotated and the CNN learns to figure out the direction. Therefore, this task is a 4-way classification task. Most images are taken upright which could be considered as labeled images with label 90 degrees. The authors of RotNet argue that the concept of 'upright' is hard to understand and requires high-level knowledge about the image, so this task encourages the network to discover more complex information about the images. <br />
* DeepCluster [4] alternates between k-means clustering step, in which pseudo-labels are assigned to the data by k-means on the PCA-reduced features, and the learning step in which the model tries to learn to fit the representation to these labels(cluster IDs) under several image transformations. These transformations include random resized crops and horizontal flips.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<div align="center">'''Table :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conducted interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet?<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48795CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-02T03:18:44Z<p>Mdadbin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In the recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<div align="center">'''Table :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conducted interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet?<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48794CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-02T03:17:53Z<p>Mdadbin: /* Previous Work */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In the recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<div align="center">'''Table :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conduct interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet?<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48793CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-02T03:16:59Z<p>Mdadbin: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<div align="center">'''Table :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conduct interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet?<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48792CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-02T03:16:22Z<p>Mdadbin: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<div align="center">'''Table :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conduct interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet?<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=47321DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-28T18:13:18Z<p>Mdadbin: /* Results */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning, and it refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The authors have changed the belief representations to learn a critic directly on latent state samples which help to enable scaling to more complex tasks.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behavior learning, and environment interaction. In the compact latent space of the world model, the behavior is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|800px|Dreamer algorithm|center]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
In the following picture we can see the reward vs the environment steps. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance and overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to environments of higher visual complexity can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
In the model components in Appendix A, they have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=47318DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-28T18:12:17Z<p>Mdadbin: /* Results */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning, and it refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The authors have changed the belief representations to learn a critic directly on latent state samples which help to enable scaling to more complex tasks.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximize future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behavior learning, and environment interaction. In the compact latent space of the world model, the behavior is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|800px|Dreamer algorithm|center]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance and overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
In the following picture we can see the reward vs the environment steps. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to environments of higher visual complexity can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
In the model components in Appendix A, they have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:dreamer.paper19.png&diff=47314File:dreamer.paper19.png2020-11-28T18:03:34Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services&diff=47297Adacompress: Adaptive compression for online computer vision services2020-11-28T16:22:02Z<p>Mdadbin: </p>
<hr />
<div><br />
== Presented by == <br />
Ahmed Hussein Salamah<br />
<br />
== Introduction == <br />
<br />
Big data and deep learning have been merged to create the great success of artificial intelligence which increases the burden on the network's speed, computational complexity, and storage in many applications. The image Classification task is one of the most important computer vision tasks which has shown a high dependency on Deep Neural Networks to improve their performance in many applications. Recently, they tend to use different image classification models on the cloud just to share the computational power between the different users as mentioned in this paper (e.g., SenseTime, Baidu Vision and Google Vision, etc.). Most of the researchers in the literature work to improve the structure and increase the depth of DNNs to achieve better performance from the point of how the features are represented and crafted using Conventional Neural Networks (CNNs). As the most well-known image classification datasets (e.g. ImageNet) are compressed using JPEG and this compression technique is optimized for Human Visual System (HVS) but not the machines (i.e. DNNs), so to be aligned with HVS the authors have to reconfigure the JPEG while maintaining the same classification accuracy. <br />
<br />
'''Why is image compression important?'''<br />
<br />
Image compression is crucial in deep learning because we want the image data to take up less disk space and can be loaded faster. Compared to the lossless compression PNG, which preserves the original image data, JPEG is a lossy form of compression meaning some information will be lost though for the benefit of an improved compression ratio. Therefore, it is important to develop deep learning model-based image compression methods which reduce data size without jeopardizing classification accuracy. Some examples of this type of image compression includes the LSTM-based approach proposed by Google [9], the transformation-based method from New York University [10], the autoencoder-based approach by Twitter [11], etc.<br />
<br />
== Methodology ==<br />
<br />
[[File: ada-fig2.PNG | 400px | center]]<br />
<div align="center">'''Figure 1:''' Comparing to the conventional solution, the authors [1] solution can update the compression strategy based on the backend model feedback </div><br />
<br />
One of the major parameters that can be changed in the JPEG pipeline is the quantization table, which is the main source of artifacts added in the image to make it lossless compression as shown in [1, 4]. The authors got motivated to change the JPEG configuration to optimize the uploading rate of different cloud computer vision without considering pre-knowledge of the original model and dataset. In contrast to the authors in [2, 3, 5] where they adjust the JPEG configuration by retraining the parameters or according to the structure of the model. They considered the lack of undefined quantization level which decreases the image rate and quality but the deep learning model can still recognize it as shown in [4]. The authors in [1] used Deep Reinforcement learning (DRL) in an online manner to choose the quantization level to upload an image to the cloud for the computer vision model and this is the only approach to design an adaptive JPEG based on ''RL mechanism''.<br />
<br />
The approach is designed based on an interactive training environment which represents any computer vision cloud services, then they needed a tool to evaluate and predict the performance of quantization level on an uploaded image, so they used a deep Q neural network agent. They feed the agent with a reward function which considers two optimization parameters, accuracy and image size. It works as iterative behavior interacting with the environment. The environment is exposed to different images with different virtual redundant information that needs an adaptive solution for each image to select the suitable compression level for the model. Thus, they designed an explore-exploit mechanism to train the agent on different scenery which is designed in deep Q agent as an inference-estimate-retain mechanism to restart the training procedure for each image. The authors verify their approach by providing some analysis and insight using Grad-Cam [8] by showing some patterns of how a compression level is chosen for each image with its own corresponding quality factor. Each image shows a different response when shown to a deep learning model. In general, images are more sensitive to compression if they have large smooth areas, while those with complex textures are more robust to compression.<br />
<br />
'''What is a quantization table?'''<br />
<br />
Before getting to the quantization table first look at the basic architecture of JPEG's baseline system. This has 4 blocks, which are FDCT (Fast Discrete Cosine Transformation), quantizer, statistical model, and entropy encoder. The FCDT block takes an input image separated into <math> n \times n </math> blocks and applies a discrete cosine transformation creating DCT terms. These DCT terms are values from a relatively large discrete set that will be then mapped through the process of quantization to a smaller discrete set. This is accomplished with a quantization table at the quantizer block, which is designed to preserve low-frequency information at the cost of the high-frequency information. This preference for low frequency information is made because losing high frequency information isn't as impactful to the image when perceived by a humans visual system.<br />
<br />
== Problem Formulation ==<br />
<br />
The authors formulate the problem by referring to the cloud deep learning service as <math> \vec{y}_i = M(x_i)</math> to predict results list <math> \vec{y}_i </math> for an input image <math> x_i </math>, and for reference input <math> x \in X_{\rm ref} </math> the output is <math> \vec{y}_{\rm ref} = M(x_{\rm ref}) </math>. It is referred <math> \vec{y}_{\rm ref} </math> as the ground truth label and also <math> \vec{y}_c = M(x_c) </math> for compressed image <math> x_{c} </math> with quality factor <math> c </math>.<br />
<br />
<br />
\begin{align} \tag{1} \label{eq:accuracy}<br />
\mathcal{A} =& \sum_{k}\min_jd(l_j, g_k) \\ <br />
& l_j \in \vec{y}_c, \quad j=1,...,5 \nonumber \\<br />
& g_k \in \vec{y}_{\rm ref}, \quad k=1, ..., {\rm length}(\vec{y}_{\rm ref}) \nonumber \\<br />
& d(x, y) = 1 \ \text{if} \ x=y \ \text{else} \ 0 \nonumber<br />
\end{align}<br />
<br />
The authors divided the used datasets according to their contextual group <math> X </math> according to [6] and they compare their results using compression ratio <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>, where <math>s_{c}</math> is the compressed size and <math>s_{\rm ref}</math> is the original size, and accuracy metric <math> \mathcal{A}_c </math> which is calculated based on the hamming distance of Top-5 of the output of softmax probabilities for both original and compressed images as shown in Eq. \eqref{eq:accuracy}. In the RL designing stage, continuous numerical vectors are represented as the input features to the DRL agent which is Deep Q Network (DQN). The challenges of using this approach are: <br />
(1) The state space of RL is too large to cover, so the neural network is typically constructed with more layers and nodes, which makes the DRL agent hard to converge and the training time-consuming; <br />
(2) The DRL always starts with a random initial state, but it needs to find a higher reward before starting the training of the DQN. However, the sparse reward feedback resulted from a random initialization makes learning difficult.<br />
The authors solve this problem by using a pre-trained small model called MobileNetV2 as a feature extractor <math> \mathcal{E} </math> for its ability in lightweight and image classification, and it is fixed during training the Q Network <math> \phi </math>. The last convolution layer of <math> \mathcal{E} </math> is connected as an input to the Q Network <math>\phi </math>, so by optimizing the parameters of Q network <math> \phi </math>, the RL agent's policy is updated.<br />
<br />
==Reinforcement learning framework==<br />
<br />
This paper [1] described the reinforcement learning problem as <math> \{\mathcal{X}, M\} </math> to be ''emulator environment'', where <math> \mathcal{X} </math> is defining the contextual information created as an input from the user <math> x </math> and <math> M </math> is the backend cloud model. Each RL frame must be defined by ''action and state'', the action is known by 10 discrete quality levels ranging from 5 to 95 by step size of 10 and the state is feature extractor's output <math> \mathcal{E}(J(\mathcal{X}, c)) </math>, where <math> J(\cdot) </math> is the JPEG output at specific quantization level <math> c </math>. They found the optimal quantization level at time <math> t </math> is <math> c_t = {\rm argmax}_cQ(\phi(\mathcal{E}(f_t)), c; \theta) </math>, where <math> Q(\phi(\mathcal{E}(f_t)), c; \theta) </math> is action-value function, <math> \theta </math> indicates the parameters of Q network <math> \phi </math>. In the training stage of RL, the goal is to minimize a loss function <math> L_i(\theta_i) = \mathbb{E}_{s, c \sim \rho (\cdot)}\Big[\big(y_i - Q(s, c; \theta_i)\big)^2 \Big] </math> that changes at each iteration <math> i </math> where <math> s = \mathcal{E}(f_t) </math> and <math>f_t</math> is the output of the JPEG, and <math> y_i = \mathbb{E}_{s' \sim \{\mathcal{X}, M\}} \big[ r + \gamma \max_{c'} Q(s', c'; \theta_{i-1}) \mid s, c \big] </math> is the target that has a probability distribution <math> \rho(s, c) </math> over sequences <math> s </math> and quality level <math> c </math> at iteration <math> i </math>, and <math> r </math> is the feedback reward. <br />
<br />
The framework get more accurate estimation from a selected action when the distance of the target and the action-value function's output <math> Q(\cdot)</math> is minimized. As a results, no feedback signal can tell that an episode has finished a condition value <math>T</math> that satisfies <math> t \geq T_{\rm start} </math> to guarantee to store enough transitions in the memory buffer <math> D </math> to train on. To create this transitions for the RL agent, random trials are collected to observe environment reaction. After fetching some trials from the environment with their corresponding rewards, this randomness is decreased as the agent is trained to minimize the loss function <math> L </math> as shown in the Algorithm below. Thus, it optimizes its actions on a minibatch from <math> \mathcal{D} </math> to be based on historical optimal experience to train the compression level predictor <math> \phi </math>. When this trained predictor <math> \phi </math> is deployed, the RL agent will drive the compression engine with the adaptive quality factor <math> c </math> corresponding to the input image <math> x_{i} </math>. <br />
<br />
The interaction between the agent and environment <math> \{\mathcal{X}, M\} </math> is evaluated using the reward function, which is formulated, by selecting an appropriate action of quality factor <math> c </math>, to be directly proportional to the accuracy metric <math> \mathcal{A}_c </math>, and inversely proportional to the compression rate <math> \Delta s = \frac{s_c}{s_{\rm ref}} </math>. As a result, the reward function is given by <math> R(\Delta s, \mathcal{A}) = \alpha \mathcal{A} - \Delta s + \beta</math>, where <math> \alpha </math> and <math> \beta </math> to form a linear combination.<br />
<br />
[[File:Alg2.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Algroithim :''' Training RL agent <math> \phi </math> in environment <math> \{\mathcal{X}, M\} </math> </div><br />
<br />
== Inference-Estimate-Retrain Mechanism ==<br />
The system diagram, AdaCompress, is shown in figure 3 in contrast to the existing modules. When the AdaCompress is deployed, the input images scenery context <math> \mathcal{X} </math> may change, in this case the RL agent’s compression selection strategy may cause the overall accuracy to decrease. So, in order to solve this issue, the estimator will be invoked with probability <math>p_{\rm est} </math>. This will be done by generating a random value <math> \xi \in (0,1) </math> and the estimator will be invoked if <math>\xi \leq p_{\rm est}</math>. Then AdaCompress will upload both the original image and the compressed image to fetch their labels. The accuracy will then be calculated and the transition, which also includes the accuracy in this step, will be stored in the memory buffer. Comparing recent the n steps' average accuracy with earliest average accuracy, the estimator will then invoke the RL training kernel to retrain if the recent average accuracy is much lower than the initial average accuracy.<br />
<br />
[[File: diagfig.png|500px|center]]<br />
<br />
The authors solved the change in the scenery at the inference phase that might cause learning to diverge by introducing '''running-estimate-retain mechanism'''. They introduced estimator with probability <math> p_{\rm est} </math> that changes in an adaptive way and it is compared a generated random value <math> \xi \in (0,1) </math>. As shown in Figure 2, Adacompression is switching between three states in an adaptive way as will be shown in the following sections.<br />
<br />
[[File:fig3.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 2:''' State Switching Policy </div><br />
<br />
=== Inference State ===<br />
The '''inference state''' is running most of the time at which the deployed RL agent is trained and used to predict the compression level <math> c </math> to be uploaded to the cloud with minimum uploading traffic load. The agent will eventually switch to the estimator stage with probability <math> p_{\rm est} </math> so it will be robust to any change in the scenery to have a stable accuracy. The <math> p_{\rm est} </math> is fixed at the inference stage but changes in an adaptive way as a function of accuracy gradient in the next stage. In '''estimator state''', there will be a trade off between the objective of reducing upload traffic and the risk of changing the scenery, an accuracy-aware dynamic <math> p'_{\rm est} </math> is designed to calculate the average accuracy <math> \mathcal{A}_n </math> after running for defined <math> N </math> steps according to Eq. \ref{eqn:accuracy_n}.<br />
\begin{align} \tag{2} \label{eqn:accuracy_n}<br />
\bar{\mathcal{A}_n} &=<br />
\begin{cases}<br />
\frac{1}{n}\sum_{i=N-n}^{N} \mathcal{A}_i & \text{ if } N \geq n \\ <br />
\frac{1}{n}\sum_{i=1}^{n} \mathcal{A}_i & \text{ if } N < n <br />
\end{cases}<br />
\end{align}<br />
===Estimator State===<br />
The '''estimator state''' is executed when <math> \xi \leq p_{\rm est} </math> is satisfied , where the uploaded traffic is increased as the both the reference image <math> x_{ref} </math> and compressed image <math> x_{i} </math> are uploaded to the cloud to calculate <math> \mathcal{A}_i </math> based on <math> \vec{y}_{\rm ref} </math> and <math> \vec{y}_i </math>. It will be stored in the memory buffer <math> \mathcal{D} </math> as a transition <math> (\phi_i, c_i, r_i, \mathcal{A}_i) </math> of trial <math>i</math>. The estimator will not be anymore suitable for the latest <math>n</math> step when the average accuracy <math> \bar{\mathcal{A}}_n </math> is lower than the earliest <math>n</math> steps of the average <math> \mathcal{A}_0 </math> in the memory buffer <math> \mathcal{D} </math>. Consequently, <math> p_{\rm est} </math> should be changed to higher value to make the estimate stage frequently happened.It is obviously should be a function in the gradient of the average accuracy <math> \bar{\mathcal{A}}_n </math> in such a way to fell the buffer memory <math> \mathcal{D} </math> with some transitions to retrain the agent at a lower average accuracy <math> \bar{\mathcal{A}}_n </math>. The authors formulate <math> p'_{\rm est} = p_{\rm est} + \omega \nabla \bar{\mathcal{A}} </math> and <math> \omega </math> is a scaling factor. Initially the estimated probability <math> p_0 </math> will be a function of <math> p_{\rm est} </math> in the general form of <math>p_{\rm est} = p_0 + \omega \sum_{i=0}^{N} \nabla \bar{\mathcal{A}_i} </math>. <br />
<br />
===Retrain State===<br />
In '''retrain state''', the RL agent is trained to adapt on the change of the input scenery on the stored transitions in the buffer memory <math> \mathcal{D} </math>. The retain stage is finished at the recent <math> n </math> steps when the average reward <math> \bar{r}_n </math> is higher than a defined <math> r_{th}</math> by the user. Afterward, a new retraining stage should be prepared by saving new next transitions after flushing the old buffer memory <math> \mathcal{D}</math>. The authors supported their compression choice for different cloud application environments by providing some insights by introducing a visualization algorithm [8] to some images with their corresponding quality factor <math> c </math>. The visualization shows that the agent chooses a certain quantization level <math> c </math> based on the visual textures in the image at the different regions. For an instant, a low-quality factor is selected for the rough central region so there is a smooth area surrounded it but for the surrounding smooth region, the agent chooses a relatively higher quality rather than the central region.<br />
<br />
<br />
<br />
==Insight of RL agent’s behavior==<br />
In the inference state, the RL agent predicts a proper compression level based on the features of the input image. In the next subsection, we will see that this compression level varies for different image sets and backend cloud services. Also, by taking a look at the attention maps for some of the images, we will figure out why the agent has chosen this compression level.<br />
===Compression level choice variation===<br />
In Figure 5, for Face++ and Amazon Rekognition, the agent’s choices are mostly around compression level = 15, but for Baidu Vision, the agent’s choices are distributed more evenly. Therefore, the backend strategy really affects the choice for the optimal compression level.<br />
<br />
[[File:comp-level1.PNG|500px|center|fig: running-retrain]]<br />
In figure 6, we will see how the agent's behaviour in selecting the optimal compression level changes for different datasets. The two datasets, ImageNet and DNIM present different contextual sceneries. The images mostly taken at daytime were randomly selected from ImageNet and the images mostly taken at the night time were selected from DNIM. The figure 6 shows that for DNiM images, the agent's choices are mostly concentrated in relatively high compression levels, whereas for ImageNet dataset, the agent's choices are distributed more evenly. <br />
<br />
[[File:comp-level2.PNG|500px|center|fig: running-retrain]]<br />
<br />
<br />
<br />
<br />
<br />
<br />
== Results ==<br />
The authors reported in Figure 3, 3 different cloud services compared to the benchmark images. It is shown that more than the half of the upload size while roughly preserving the top-5 accuracy calculated by using A with an average of 7% proving the efficiency of the design. In Figure 4, it shows the ''' inference-estimate-retain ''' mechanism as the x-axis indicates steps, while <math> \Delta </math> mark on <math>x</math>-axis is reveal as a change in the scenery. In Figure 4, the estimating probability <math> p_{\rm est} </math> and the accuracy are inversely proportion as the accuracy drops below the initial value the <math> p_{\rm est} </math> increase adaptive as it considers the accuracy metric <math> \mathcal{A}_c </math> each action <math> c </math> making the average accuracy to decrease in the next estimations. At the red vertical line, the scenery started to change and <math>Q</math> Network start to retrain to adapt the the agent on the current scenery. At retrain stage, the output result is always use from the reference image's prediction label <math> \vec{y}_{\rm ref} </math>. <br />
Also, they plotted the scaled uploading data size of the proposed algorithm and the overhead data size for the benchmark is shown in the inference stage. After the average accuracy became stable and high, the transmission is reduced by decreasing the <math> p_{\rm est} </math> value. As a result, <math> p_{\rm est} </math> and <math> \mathcal{A} </math> will be always equal to 1. During this stage, the uploaded file is more than the conventional benchmark. In the inference stage, the uploaded size is halved as shown in both Figures 3, 4.<br />
<br />
[[File:ada-fig9.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 3:''' Different cloud services compared relative to average size and accuracy </div><br />
<br />
<br />
[[File:ada-fig10.PNG|500px|center|fig: running-retrain]]<br />
<div align="center">'''Figure 4:''' Scenery change response from AdaCompress Algorithm </div><br />
<br />
==Conclusion==<br />
<br />
Most of the research focused on modifying the deep learning model instead of dealing with the currently available approaches. The authors succeed in defining the compression level for each uploaded image to decrease the size and maintain the top-5 accuracy in a robust manner even the scenery is changed. <br />
In my opinion, Eq. \eqref{eq:accuracy} is not defined well as I found it does not really affect the reward function. Also, they did not use the whole validation set from ImageNet which raises the question of what is the higher file size that they considered from in the mention current set. In addition, if they considered the whole data set, should we expect the same performance for the mechanism.<br />
<br />
== Critiques == <br />
<br />
The authors used a pre-trained model as a feature extractor to select a Quality Factor (QF) for the JPEG. I think what would be missing that they did not report the distribution of each of their span of QFs as it is important to understand which one is expected to contribute more. In my video, I have done one experiment using Inception-V3 to understand if it is possible to get better accuracy. I found that it is possible by using the inception model as a pre-trained model to choose a lower QF, but as well known that the mobile models are shallower than the inception models which make it less complex to run on edge devices. I think it is possible to achieve at least the same accuracy or even more if we replaced the mobile model with the inception. Another point, the authors did not run their approach on a complete database like ImageNet, they only included a part of two different datasets. I know they might have limitations in the available datasets to test like CIFARs, as they are not totally comparable from the resolution perspective for the real online computer vision services work with higher resolutions.<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AhmedHussKhalifa/AdaCompress<br />
<br />
== References ==<br />
<br />
[1] Hongshan Li, Yu Guo, Zhi Wang, Shutao Xia, and Wenwu Zhu, “Adacompress: Adaptive compression for online computer vision services,” in Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 2019, MM ’19, pp. 2440–2448, ACM.<br />
<br />
[2] Zihao Liu, Tao Liu, Wujie Wen, Lei Jiang, Jie Xu, Yanzhi Wang, and Gang Quan, “DeepN-JPEG: A deep neural network favorable JPEG-based image compression<br />
framework,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 18.<br />
<br />
[3] Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu, and Jason Yosinski, “Faster neural networks straight from jpeg,” in Advances in Neural Information Processing Systems, 2018, pp. 3933–3944.<br />
<br />
[4] Kresimir Delac, Mislav Grgic, and Sonja Grgic, “Effects of jpeg and jpeg2000 compression on face recognition,” in Pattern Recognition and Image Analysis, Sameer Singh, Maneesha Singh, Chid Apte, and Petra Perner, Eds., Berlin, Heidelberg, 2005, pp. 136–145, Springer Berlin Heidelberg.<br />
<br />
[5] Robert Torfason, Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool, “Towards image understanding from deep compression without decoding,” 2018.<br />
<br />
[6] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy, “Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, New York, NY, USA, 2016, MobiSys ’16, pp. 123–136, ACM.<br />
<br />
[7] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.<br />
<br />
[8] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra, “Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization,” CoRR, vol. abs/1610.02391, 2016.<br />
<br />
[9] George Toderici, Sean M. O'Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, Rahul Sukthankarm, Variable Rate Image Compression with Recurrent Neural Networks, ICLR 2016, arXiv:1511.06085<br />
<br />
[10] Johannes Ballé, Valero Laparra, Eero P. Simoncelli, End-to-end Optimized Image Compression, ICLR 2017, arXiv:1611.01704<br />
<br />
[11] Lucas Theis, Wenzhe Shi, Andrew Cunningham, Ferenc Huszár, Lossy Image Compression with Compressive Autoencoders, ICLR 2017, arXiv:1703.00395</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:attMap.PNG&diff=47296File:attMap.PNG2020-11-28T16:11:29Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:comp-level2.PNG&diff=47295File:comp-level2.PNG2020-11-28T16:10:33Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:comp-level1.PNG&diff=47294File:comp-level1.PNG2020-11-28T16:10:02Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=47125Dense Passage Retrieval for Open-Domain Question Answering2020-11-28T01:17:44Z<p>Mdadbin: /* Experimental Setup */</p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. The code for this paper is freely available at https://github.com/facebookresearch/DPR.<br />
<br />
= Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader. <br />
<br />
=== Mathematical Formulation ===<br />
Let's assume that we have a collection of <math> D </math> documents <math>d_1, d_2, \cdots, d_D </math> where each <math>d</math> is split into text passages of equal lengths as the basic retrieval units. Then the total number of passages is <math> M </math> in the corpus <math>\mathcal{C} = \{p_1, p_2, \ldots, p_{M}\} </math>. Each passage <math> p_i </math> is composed of a sequence of tokens <math> w^{(i)}_1, w^{(i)}_2, \cdots, w^{(i)}_{|p_i|} </math>. The task is to find a span of tokens <math> w^{(i)}_s, w^{(i)}_{s+1}, \cdots, w^{(i)}_{e} </math> from one of the passages <math> p_i </math> to answer the given question <math>q_i</math> . The corpus can contain millions of documents just to cover a wide variety of domains (e.g. Wikipedia) or even billions of them (e.g. the Web).<br />
The open-domain QA systems define the retriever as a function <math> R:(q,\mathcal{C}) \rightarrow \mathcal{C_F} </math> that takes as input a question <math>q</math> and a corpus <math>\mathcal{C}</math> and then returns a much smaller filter set of texts <math>\mathcal{C_F} \subset \mathcal{C}</math> , where <math>|\mathcal{C_F}| = k \ll |\mathcal{C}|</math>. The retriever can be evaluated by '''top-k retrieval accuracy''' that represents the fraction of relevant answers contained in <math>\mathcal{C_F}</math>.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
During the inference time, they used FAISS (Johnson et al., 2017) for the similarity search, which can be efficiently applied on billions of dense vectors<br />
<br />
== Training ==<br />
The goal in training the encoders is to create a vector space where the relevant pairs of question and passages (positive passages) have smaller distance than the irrelevant ones (negative passages) which is a metric learning problem.<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages.<br />
\begin{eqnarray}<br />
&& L(q_i, p^+_i, p^-_{i,1}, \cdots, p^-_{i,n}) \label{eq:training} \\<br />
&=& -\log \frac{ e^{\mathrm{sim}(q_i, p_i^+)} }{e^{\mathrm{sim}(q_i, p_i^+)} + \sum_{j=1}^n{e^{\mathrm{sim}(q_i, p^-_{i,j})}}}. \nonumber<br />
\end{eqnarray}<br />
While positive passage selection is simple, where the passage containing the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query <math>q_i</math>, positive passage <math>p_i</math>) in a mini-batch, then the negative passages for query <math>q_i</math> are the passages <math>p_i</math> when <math>j</math> is not equal to <math>i</math>.<br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents, i.e. they removed semi-structured data, such as tables, info- boxes, lists, as well as the disambiguation pages and then splitted each document into passages of length 100 words. These passages form a candidate pool. The five QA datasets described below were used to build the train data. <br />
# '''Natural Questions (NQ)''' (Kwiatkowski et al., 2019): This was designed for the end-end question-answering tasks where the questions are Google search queries and the answers are annotated texts from the Wikipedia.<br />
# '''TriviaQA''' (Joshi et al., 2017): It is a collection of trivia questions and answers scraped from the Web.<br />
# '''WebQuestions (WQ)''' (Berant et al., 2013): It consists of questions selected using Google Search API where the answers are entities in Freebase<br />
# '''CuratedTREC (TREC)''' (Baudisˇ and Sˇedivy`, 2015): It is intended for the open-domain QA from unstructured corpora. It consists of TREC QA tracks and questions from other Web sources.<br />
# '''SQuAD v1.1''' (Rajpurkar et al., 2016): It is a popular benchmark dataset for reading comprehension. The annotators were given a paragraph from the Wikipedia and were supposed to write questions which answers could be found in the given text.<br />
To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px | center]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
<br />
The table below compares the top-k (for k=20 or k=100) accuracies of different retrieval systems on various popular QA datasets. Top-k accuracy is the percentage of examples in the test set for which the correct outcome occurs in the k most likely outcomes as predicted by the network. As can be seen, DPR consistently outperforms BM25 on all datasets, with the exception of SQuAD. Additionally, DPR tends to perform particularly well when a smaller k value in chosen. The authors speculated that the lower performance of DPR on SQuAD was for two reasons inherent to the dataset itself. First, the SQuAD dataset has high lexical overlap between passages and questions. Second, the dataset is biased because it is collected from a small number of Wikipedia articles - a point which has been argued by other researchers as well.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px | center ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px | center]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss.<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px | center]]<br />
<br />
=== Reader for End-to-End ===<br />
<br />
The reader used would score the subset of <math>k</math> passages which were retrieved using DPR. Next, the reader would extract an answer span from each passage and assign a span score. The best span from the passage with the highest selection score is chosen as the final answer. To calculate these scores linear layers followed by a softmax are applied on top of the BERT output as follows:<br />
<br />
<math>\textbf{P}_{start,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (1)<br />
<br />
<math>\textbf{P}_{end,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (2)<br />
<br />
<math>\textbf{P}_{selected}= softmax(\hat{\textbf{P}}^T_i \textbf{w}_{selected}) \in \mathbb{R}^k </math> where <math> k </math> is the number of passages. (3)<br />
<br />
Here <math> \textbf{P}_i \in \mathbb{R}^{L \times h}</math> used in (1) and (2), <math>i</math> is one of the <math> k </math> passages and <math> h </math> is the hidden layer dimension of BERT's output. Additionally, <math> \hat{\textbf{P}} = [\textbf{P}^{[CLS]}_1,...,\textbf{P}^{[CLS]}_k] \in \mathbb{R}^{h \times k}</math>. Here <math> \textbf{w}_{start},\textbf{w}_{end},\textbf{w}_{selected} </math> are learnable vectors. The span score for <math> s </math>-th starting word to the <math> t</math>-th ending word of the <math> i </math>-th passage then can be calculated as <math> P_{start,i}(s) \times P_{end,i}(t) </math> where <math> s,t </math> are used to select the corresponding index of <math> \textbf{P}_{start,i}, \textbf{P}_{end,i} </math> respectively. The passage selection score for the <math> i</math>-th passage can be computed similarly, by indexing the <math>i</math>-th element of <math>\textbf{P}_{selected}</math>.<br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=47124Dense Passage Retrieval for Open-Domain Question Answering2020-11-28T01:15:58Z<p>Mdadbin: /* Experimental Setup */</p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. The code for this paper is freely available at https://github.com/facebookresearch/DPR.<br />
<br />
= Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader. <br />
<br />
=== Mathematical Formulation ===<br />
Let's assume that we have a collection of <math> D </math> documents <math>d_1, d_2, \cdots, d_D </math> where each <math>d</math> is split into text passages of equal lengths as the basic retrieval units. Then the total number of passages is <math> M </math> in the corpus <math>\mathcal{C} = \{p_1, p_2, \ldots, p_{M}\} </math>. Each passage <math> p_i </math> is composed of a sequence of tokens <math> w^{(i)}_1, w^{(i)}_2, \cdots, w^{(i)}_{|p_i|} </math>. The task is to find a span of tokens <math> w^{(i)}_s, w^{(i)}_{s+1}, \cdots, w^{(i)}_{e} </math> from one of the passages <math> p_i </math> to answer the given question <math>q_i</math> . The corpus can contain millions of documents just to cover a wide variety of domains (e.g. Wikipedia) or even billions of them (e.g. the Web).<br />
The open-domain QA systems define the retriever as a function <math> R:(q,\mathcal{C}) \rightarrow \mathcal{C_F} </math> that takes as input a question <math>q</math> and a corpus <math>\mathcal{C}</math> and then returns a much smaller filter set of texts <math>\mathcal{C_F} \subset \mathcal{C}</math> , where <math>|\mathcal{C_F}| = k \ll |\mathcal{C}|</math>. The retriever can be evaluated by '''top-k retrieval accuracy''' that represents the fraction of relevant answers contained in <math>\mathcal{C_F}</math>.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
During the inference time, they used FAISS (Johnson et al., 2017) for the similarity search, which can be efficiently applied on billions of dense vectors<br />
<br />
== Training ==<br />
The goal in training the encoders is to create a vector space where the relevant pairs of question and passages (positive passages) have smaller distance than the irrelevant ones (negative passages) which is a metric learning problem.<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages.<br />
\begin{eqnarray}<br />
&& L(q_i, p^+_i, p^-_{i,1}, \cdots, p^-_{i,n}) \label{eq:training} \\<br />
&=& -\log \frac{ e^{\mathrm{sim}(q_i, p_i^+)} }{e^{\mathrm{sim}(q_i, p_i^+)} + \sum_{j=1}^n{e^{\mathrm{sim}(q_i, p^-_{i,j})}}}. \nonumber<br />
\end{eqnarray}<br />
While positive passage selection is simple, where the passage containing the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query <math>q_i</math>, positive passage <math>p_i</math>) in a mini-batch, then the negative passages for query <math>q_i</math> are the passages <math>p_i</math> when <math>j</math> is not equal to <math>i</math>.<br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents, i.e. they removed semi-structured data, such as tables, info- boxes, lists, as well as the disambiguation pages and then splitted each document into passages of length 100 words. These passages form a candidate pool. The five QA datasets described below were used to build the train data. <br />
# Natural Questions (NQ) (Kwiatkowski et al., 2019): This was designed for the end-end question-answering tasks where the questions are Google search queries and the answers are annotated texts from the Wikipedia.<br />
# TriviaQA (Joshi et al., 2017): It is a collection of trivia questions and answers scraped from the Web.<br />
# WebQuestions (WQ) (Berant et al., 2013): It consists of questions selected using Google Search API where the answers are entities in Freebase<br />
# CuratedTREC (TREC) (Baudisˇ and Sˇedivy`, 2015): It is intended for the open-domain QA from unstructured corpora. It consists of TREC QA tracks and questions from other Web sources.<br />
# SQuAD v1.1 (Rajpurkar et al., 2016): It is a popular benchmark dataset for reading comprehension. The annotators were given a paragraph from the Wikipedia and were supposed to write questions which answers could be found in the given text.<br />
To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px | center]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
<br />
The table below compares the top-k (for k=20 or k=100) accuracies of different retrieval systems on various popular QA datasets. Top-k accuracy is the percentage of examples in the test set for which the correct outcome occurs in the k most likely outcomes as predicted by the network. As can be seen, DPR consistently outperforms BM25 on all datasets, with the exception of SQuAD. Additionally, DPR tends to perform particularly well when a smaller k value in chosen. The authors speculated that the lower performance of DPR on SQuAD was for two reasons inherent to the dataset itself. First, the SQuAD dataset has high lexical overlap between passages and questions. Second, the dataset is biased because it is collected from a small number of Wikipedia articles - a point which has been argued by other researchers as well.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px | center ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px | center]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss.<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px | center]]<br />
<br />
=== Reader for End-to-End ===<br />
<br />
The reader used would score the subset of <math>k</math> passages which were retrieved using DPR. Next, the reader would extract an answer span from each passage and assign a span score. The best span from the passage with the highest selection score is chosen as the final answer. To calculate these scores linear layers followed by a softmax are applied on top of the BERT output as follows:<br />
<br />
<math>\textbf{P}_{start,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (1)<br />
<br />
<math>\textbf{P}_{end,i}= softmax(\textbf{P}_i \textbf{w}_{start}) \in \mathbb{R}^L </math> where <math> L </math> is the number of words in the passage. (2)<br />
<br />
<math>\textbf{P}_{selected}= softmax(\hat{\textbf{P}}^T_i \textbf{w}_{selected}) \in \mathbb{R}^k </math> where <math> k </math> is the number of passages. (3)<br />
<br />
Here <math> \textbf{P}_i \in \mathbb{R}^{L \times h}</math> used in (1) and (2), <math>i</math> is one of the <math> k </math> passages and <math> h </math> is the hidden layer dimension of BERT's output. Additionally, <math> \hat{\textbf{P}} = [\textbf{P}^{[CLS]}_1,...,\textbf{P}^{[CLS]}_k] \in \mathbb{R}^{h \times k}</math>. Here <math> \textbf{w}_{start},\textbf{w}_{end},\textbf{w}_{selected} </math> are learnable vectors. The span score for <math> s </math>-th starting word to the <math> t</math>-th ending word of the <math> i </math>-th passage then can be calculated as <math> P_{start,i}(s) \times P_{end,i}(t) </math> where <math> s,t </math> are used to select the corresponding index of <math> \textbf{P}_{start,i}, \textbf{P}_{end,i} </math> respectively. The passage selection score for the <math> i</math>-th passage can be computed similarly, by indexing the <math>i</math>-th element of <math>\textbf{P}_{selected}</math>.<br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44247When Does Self-Supervision Improve Few-Shot Learning?2020-11-14T23:29:03Z<p>Mdadbin: /* Results */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabelled data, while labelling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.<br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Suggested architecture.</div><br />
<br />
The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:<br />
<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
It is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>, in the above equations. <br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
<br />
Figure 6 shows the accuracy of the meta-learner with SSL on different domains as function of distance between the supervised domain Ds and the self-supervised domain Dss. Once again we see that the effectiveness of SSL decreases with the distance from the supervised domain across all datasets.<br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 6: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:paper9.PNG&diff=44246File:paper9.PNG2020-11-14T23:25:24Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44244When Does Self-Supervision Improve Few-Shot Learning?2020-11-14T23:24:23Z<p>Mdadbin: /* Results */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabelled data, while labelling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.<br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Suggested architecture.</div><br />
<br />
The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:<br />
<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
It is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>, in the above equations. <br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
[[File:paper9.PNG |center|800px]]<br />
<br />
<div align="center">Figure 6: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F&diff=44241When Does Self-Supervision Improve Few-Shot Learning?2020-11-14T23:21:06Z<p>Mdadbin: /* Results */</p>
<hr />
<div>== Presented by ==<br />
Arash Moayyedi<br />
<br />
== Introduction ==<br />
This paper seeks to solve the generalization issues in few-shot learning by applying self-supervised learning techniques on the base dataset. Few-shot learning refers to training a classifier on minimalist datasets, contrary to the normal practice of using massive data, in hope of successfully classifying previously unseen, but related classes. Additionally, self-supervised learning aims at teaching the agent the internal structures of the images by providing it with tasks such as predicting the degree of rotation in an image. This method helps with the mentioned generalization issue, where the agent cannot distinguish the difference between newly introduced objects.<br />
<br />
== Previous Work ==<br />
This work leverages few-shot learning, where we aim to learn general representations, so that when facing novel classes, the agent can differentiate between them with training on just a few samples. Many few-shot learning methods currently exist, among which is this paper which focuses on Prototypical Networks or ProtoNets[1] for short. There is also a section of this paper that compares this model with model-agnostic meta-learner (MAML)[2].<br />
<br />
The other machine learning technique that this paper is based on is self-supervised learning. In this technique we find a use for unlabelled data, while labelling and maintaining massive data is expensive. The image itself already contains structural information that can be utilized. There exist many SSL tasks, such as removing a part of the data in order for the agent to reconstruct the lost part. Other methods include tasks prediction rotations, relative patch location, etc.<br />
<br />
The work in this paper is also related to multi-task learning. In multi-task learning training proceeds on multiple tasks concurrently to improve each other. Training on multiple tasks is known to decline the performance on individual tasks[3] and this seems to work only for very specific combinations and architectures. This paper shows that the combination of self-supervised tasks and few-shot learning are mutually beneficial to each other and this has significant practical implications since self-supervised tasks do not require any annotations.<br />
<br />
== Method ==<br />
The authors of this paper suggest a framework, as seen in Fig. 1, that combines few-shot learning with self-supervised learning. The labeled training data consists of a set of base classes in pairs of images and labels, and its domain is denoted by <math>\mathcal{D}_s</math>. Similarly, the domain of the images used for the self-supervised tasks is shown by <math>\mathcal{D}_{ss}</math>. This paper also analyzes the effects of having <math>\mathcal{D}_s = \mathcal{D}_{ss}</math> versus <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math> on the accuracy of the final few-shot learning task.<br />
<br />
[[File:arash1.JPG |center|800px]]<br />
<br />
<div align="center">Figure 1: Suggested architecture.</div><br />
<br />
The input is connected to a feed-forward convolutional network <math>f(x)</math> and it is the shared backbone between the classifier <math>g</math> and the self-supervised target predictor <math>h</math>. The classification loss <math>\mathcal{L}_s</math> and the task prediction loss <math>\mathcal{L}_{ss}</math> are written as:<br />
<br />
<br />
<math> \mathcal{L}_s := \sum_{(x_i,y_i)\in \mathcal{D}_s} \ell(g \circ f(x_i), y_i) + \mathcal{R}(f,g), </math><br />
<br />
<math> \mathcal{L}_{ss} := \sum_{x_i\in \mathcal{D}_{ss}} \ell(h \circ f(\hat{x_i}), \hat{y_i}). </math><br />
<br />
It is common to use cross-entropy loss for the loss function, <math> \ell </math>, and <math> \ell_2 </math> norm for the regularization, <math> \mathcal{R} </math>, in the above equations. <br />
<br />
The final loss is <math>\mathcal{L} := \mathcal{L}_s + \mathcal{L}_{ss}</math>, and thus the self-supervised losses act as a data-dependent regularizer for representation learning. The gradient updates are therefore performed based on this combined loss. It should be noted that in case <math>\mathcal{D}_s \neq \mathcal{D}_{ss}</math>, a forward pass is done on a batch per each dataset, and the two losses are combined.<br />
<br />
== Experiments ==<br />
The authors of this paper have experimented on the following datasets: Caltech-UCSD birds, Stanford cars, FGVC aircraft, Stanford dogs, Oxford flowers, mini-ImageNet, and tiered-Imagenet. Each dataset is divided into three disjoint sets: base set for training the parameters, val set for validation, and the novel set for testing with a few examples per each class. Data augmentation has been used with all these datasets to improve the results.<br />
<br />
The authors used a meta-learning method based on prototypical networks where training and testing are done in stages called meta-training and meta-testing. These networks are similar to distance-based learners and metric-based learners that train on label similarity. Two tasks have been used for the self-supervised learning part, rotation and the Jigsaw puzzle[4]. In the rotation task the image is rotated by an angle <math>\theta \in \{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}</math>, which results in the input, and the target label is the index of the rotation in the list. In the Jigsaw puzzle task, the image is tiled into <math>3\times3</math> tiles and then these tiles are shuffled to produce the input image. The target is a number in range of 35 based on the hamming distance.<br />
<br />
== Results ==<br />
The results on 5-way 5-shot classification accuracy can be seen in Fig. 2. ProtoNet has been used as a baseline and is compared with the Jigsaw task, the rotation task, and both of them combined. The result is that the Jigsaw task always improves the result. However, the rotation task seems to not provide much improvement on the flowers and the aircraft datasets. The authors speculate that this might be because flowers are mostly symmetrical, making the task too hard, and that the planes are usually horizontal, making the task too simple.<br />
<br />
[[File:arash2.JPG |center|800px]]<br />
<br />
<div align="center">Figure 2: Benefits of SSL for few-shot learning tasks.</div><br />
<br />
In another attempt, it is also proven that the improvements self-supervised learning provides are much higher in more difficult few-shot learning problems. As it can be observed from Fig. 3, SSL is found to be more beneficial with greyscale or low-resolution images, which make the classification harder for natural and man-made objects, respectively.<br />
<br />
[[File:arash3.JPG |center|800px]]<br />
<br />
<div align="center">Figure 3: Benefits of SSL for harder few-shot learning tasks.</div><br />
<br />
Self-supervision has also been combined with two other meta-learners in this work, MAML and a standard feature extractor trained with cross-entropy loss (softmax). Fig. 4 summarizes these results, and even though there is an accuracy gain in all scenarios (except for two), the ProtoNet + Jigsaw combination seems to work best.<br />
<br />
[[File:arash4.JPG |center|800px]]<br />
<br />
<div align="center">Figure 4: Performance on few-shot learning using different meta-learners.</div><br />
<br />
In Fig. 5 you can see the effects of size and domain of SSL on 5-way 5-shot classification accuracy. First, only 20 percent of the data is used for meta-learning. Fig. 5(a) shows the changes in the accuracy based on increasing the percentage of the images, from the whole dataset, used for SSL. It is observed that increasing the size of the SSL dataset domain has a positive effect, with diminishing ends. Fig. 5(b) shows the effects of shifting the domain of the SSL dataset, by changing a percentage of the images with pictures from other datasets. This has a negative result and moreover, training with SSL on the 20 percent of the images used for meta-learning is often better than increasing the size, but shifting the domain. This is shown as crosses on the chart.<br />
<br />
[[File:arash5.JPG |center|800px]]<br />
<br />
<div align="center">Figure 5: (a) Effect of number of images on SSL. (b) Effect of domain shift on SSL.</div><br />
<br />
[[File:paper9.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: Effectiveness of SSL as a function of domain distance between Ds and Dss (shown on top).</div><br />
<br />
The improvements obtained here generalize to other meta-learners as well. For instance, 5-way 5-shot accuracies across five fine-grained datasets for softmax, MAML, and ProtoNet improve when combined with the jigsaw puzzle task.<br />
<br />
Results also show that Self-supervision alone is not enough. A ResNet18 trained with SSL alone achieved 32.9% (w/ jigsaw) and 33.7% (w/ rotation) 5-way 5-shot accuracy averaged across five fine-grained datasets. While this is better than a random initialization (29.5%), it is dramatically worse than one trained with a simple cross-entropy loss (85.5%) on the labels.<br />
<br />
== Conclusion ==<br />
The authors of this paper provide us with a great insight on the effects of using SSL as a regulizer for few-shot learning methods. It is proven that SSL is beneficial in almost every case, however, these improvements are much higher in more difficult tasks. It also showed that the dataset used for SSL should not necessarily be large. Increasing the size of the mentioned dataset can possibly help, but only if the added images are from the same or a similar domain.<br />
<br />
== Critiques ==<br />
The authors of this paper could have analyzed other SSL tasks in addition to the Jigsaw puzzle and the rotation task, e.g. number of objects and removed patch prediction. Additionally, while analyzing the effects of the data used for SSL, they did not experiment with adding data from other domains, while fully utilizing the base dataset. Moreover, comparing their work with previous works (Fig. 6), we can see they have used mini-ImageNet with a picture size of <math>244\times224</math> in contrast to other methods that have used a <math>84\times84</math> image size. This gives them a huge advantage, however, we still notice that other methods with smaller images have achieved higher accuracy.<br />
<br />
[[File:arash6.JPG |center|800px]]<br />
<br />
<div align="center">Figure 6: Comparison with prior works on mini_ImageNet.</div><br />
<br />
== References ==<br />
<br />
[1]: Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)<br />
<br />
[2]: Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML (2017)<br />
<br />
[3]: Kokkinos, I.: Ubernet: Training a universal convolutional neural network for low-, mid-, and<br />
high-level vision using diverse datasets and limited memory. In: CVPR (2017)<br />
<br />
[4]: Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: ECCV (2016)</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:paper9.png&diff=44238File:paper9.png2020-11-14T23:17:00Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=44037a fair comparison of graph neural networks for graph classification2020-11-14T10:15:22Z<p>Mdadbin: /* Comparison with Published Results */</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
Experimental reproducibility and replicability are critical topics in machine learning.<br />
Authors have often raised concerns about their lack in scientific publications<br />
to improve the quality of the field. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
<br />
<br />
In the above figure we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
As well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=44036a fair comparison of graph neural networks for graph classification2020-11-14T10:14:06Z<p>Mdadbin: </p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
Experimental reproducibility and replicability are critical topics in machine learning.<br />
Authors have often raised concerns about their lack in scientific publications<br />
to improve the quality of the field. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
In the above figure we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
As well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:paper11.png&diff=44035File:paper11.png2020-11-14T10:07:23Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=44034a fair comparison of graph neural networks for graph classification2020-11-14T10:06:26Z<p>Mdadbin: </p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
Experimental reproducibility and replicability are critical topics in machine learning.<br />
Authors have often raised concerns about their lack in scientific publications<br />
to improve the quality of the field. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
===COMPARISON WITH PUBLISHED RESULTS===<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
As well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=43837Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-12T23:09:19Z<p>Mdadbin: </p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding small noise to an input image and as a result, the prediction of the model changed.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images as well along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been build that protects neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses based on heuristics and tricks are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image), and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images is created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column corresponds to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, where as the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Ablation.png&diff=43836File:Ablation.png2020-11-12T23:05:35Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration&diff=43834The Curious Case of Degeneration2020-11-12T19:59:17Z<p>Mdadbin: </p>
<hr />
<div>== Presented by == <br />
Donya Hamzeian<br />
== Introduction == <br />
Text generation is the act of automatically generating natural language texts like summarization, neural machine translation, fake news generation and etc. Degeneration happens when the output text is incoherent or produces repetitive results. For example in the figure below, the GPT2 model tries to generate the continuation text given the context. On the left side, the beam-search was used as the decoding strategy which has obviously stuck in a repetitive loop. On the right side, however, you can see how the pure sampling decoding strategy has generated incoherent results. <br />
[[File: GPT2_example.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
As a quick recap, the beam search is a best-first search algorithm. At each step, it selects the K most-probable predictions, where K is the beam width parameter set by humans. If K is 1, the beam search algorithm becomes the greedy search algorithm, where only the best prediction is picked. In beam search, the system only explores K paths, which reduces the memory requirements. <br />
<br />
The authors argue that decoding strategies that are based on maximization like beam search lead to degeneration even with powerful models like GPT-2. Even though there are some utility functions that encourage diversity, they are not enough and the text generated by maximization, beam-search, or top-k sampling is too probable which indicates the lack of diversity (variance) compared to human-generated texts<br />
<br />
Some may raise this question that the problem with beam-search may be due to search error i.e. they are more probable phrases that beam search is unable to find, but the point is that natural language has lower per-token probability on average and people usually optimize against saying the obvious.<br />
<br />
The authors blame the long, unreliable tail in the probability distribution of tokens that the model samples from i.e. vocabularies with low probability frequently appear in the output text. So, top-k sampling with high values of k may produce texts closer to human texts, yet they have high variance in likelihood leading to incoherency issues. <br />
Therefore, instead of fixed k, it is good to dynamically increase or decrease the number of candidate tokens. Nucleus Sampling which is the contribution of this paper does this expansion and contraction of the candidate pool.<br />
<br />
<br />
===The problem with a fixed k===<br />
<br />
In the figure below, it can be seen why having a fixed k in the top-k sampling decoding strategy can lead to degenerative results, more specifically, incoherent and low diversity texts. For instance, in the left figure, the distribution of the next token is flat i.e. there are many tokens with nearly equal probability to be the next token. In this case, if we choose a small k, like 5, some tokens like "meant" and "want" may not appear in the generated text which makes it less diverse. On the other hand, in the right figure, the distribution of the next token is peaked, i.e there are very few words with very high probability. In this case, if we choose k to be large, like 10, we may end up choosing tokens like "going" and "n't" which makes the generated text incoherent. Therefore, it seems that having a fixed-k may lead to degeneration<br />
<br />
<br />
[[File: fixed-k.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
==Language Model Decoding==<br />
There are two types of generation tasks. <br />
<br />
1. Directed generation tasks: In these tasks, there are pairs of (input, output), where the model tries to generate the output text which is tightly scoped by the input text. Because of this constraint, these tasks suffer less from the degeneration. Summarization, neural machine translation, and input-to-text generation are some examples of these tasks.<br />
<br />
2. Open-ended generation tasks like conditional story generation or like the tasks in the above figure have high degrees of freedom. As a result, degeneration is more frequent in these tasks, and in fact, they are the focus of this paper.<br />
<br />
The goal of the open-ended tasks is to generate the next n continuation tokens given a context sequence with m tokens. That is to maximize the following probability. <br />
====Nucleus Sampling====<br />
This decoding strategy is indeed truncating the long tail of the probability distribution. In order to do that, first, we need to find the smallest vocabulary set <math>V^{(p)}</math> which satisfies <math>\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math>. Then set <math>p'=\Sigma_{x \in V^{(p)}} P(x|x_{1:i-1}) \ge p</math> and rescale the probability distribution with <math>p'</math> and select the tokens from <math>P'</math>. <br />
<math><br />
P'(x|x_{1:i-1}) = \begin{cases}<br />
\frac{P(x|x_{1:i-1})}{p'}, & \mbox{if } x \in V^{(p)} \\<br />
0 & \mbox{if } otherwise<br />
\end{cases}<br />
<br />
</math><br />
<br />
====Top-k Sampling====<br />
Top-k sampling also relies on truncating the distribution. In this decoding strategy, we need to first find a set of tokens with size k <math>V^{(k)} </math> which maximizes <math>\Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math> and set <math>p' = \Sigma_{x \in V^{(k)}} P(x|x_{1:i-1})</math>. Finally, rescale the probability distribution similar to the Nucleus sampling.<br />
<br />
<br />
====Sampling with Temperature====<br />
In this method, which was proposed in [1], the probability of tokens are calculated according to the equation below where 0<t<1 and <math>u_{1:|V|} </math> are logits. Recent studies have shown that lowering t improves the quality of the generated texts while it decreases diversity.<br />
<br />
<math><br />
P(x= V_l|x_{1:i-1}) = \frac{\exp(\frac{u_l}{t})}{\Sigma_{l'}\exp(\frac{u'_l}{t})}<br />
</math><br />
<br />
==Likelihood Evaluation==<br />
To see the results of the nucleus decoding strategy, they used GPT2-large that was trained on WebText to generate 5000 text documents conditioned on initial paragraphs with 1-40 tokens.<br />
<br />
<br />
====Perplexity====<br />
<br />
This score was used to compare the coherence of different decoding strategies. By looking at the graphs below, it is possible for Sampling, Top-k sampling, and Nucleus strategies to be tuned such that they achieve a perplexity close to the perplexity of human-generated texts; however, with the best parameters according to the perplexity the first two strategies generate low diversity texts. <br />
<br />
[[File: Perplexity.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<br />
==Distributional Statistical Evaluation==<br />
====Zipf Distribution Analysis====<br />
Zipf's law says that the frequency of any word is inversely proportional to its rank in the frequency table, so it suggests that there is an exponential relationship between the rank of each word with its frequency in the text. By looking at the graph below, it seems that the Zipf's distribution of the texts generated with Nucleus sampling is very close to the Zipf's distribution of the human-generated(gold) texts, while beam-search is very different from them.<br />
[[File: Zipf.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<br />
====Self BLEU====<br />
The Self-BLEU score[2] is used to compare the diversity of each decoding strategy and was computed for each generated text using all other generations in the evaluation set as references. In the figure below, the self-BLEU score of three decoding strategies- Top-K sampling, Sampling with Temperature, and Nucleus sampling- were compared against the Self-BLEU of human-generated texts. By looking at the figure below, we see that high values of parameters that generate the self-BLEU close to that of the human texts result in incoherent, low perplexity, in Top-K sampling and Temperature Sampling, while this is not the case for Nucleus sampling. <br />
<br />
[[File: BLEU.png |caption=Example text|center |800px|caption position=bottom]]<br />
<br />
<br />
==Conclusion==<br />
In this paper, different decoding strategies were analyzed on open-ended generation tasks. They showed that likelihood maximization decoding causes degeneration where decoding strategies- which rely on truncating the probability distribution of tokens- especially Nucleus sampling can produce coherent and diverse texts close to human-generated texts.<br />
<br />
== References ==<br />
[1]: David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.<br />
<br />
[2]: Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models. SIGIR, 2018</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:fixed-k.png&diff=43833File:fixed-k.png2020-11-12T19:56:59Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data&diff=43361Learning The Difference That Makes A Difference With Counterfactually-Augmented Data2020-11-07T16:34:08Z<p>Mdadbin: /* Example */</p>
<hr />
<div>== Presented by == <br />
Syed Saad Naseem<br />
<br />
== Introduction == <br />
This paper addresses the problem of building models for NLP tasks that are robust against spurious correlations in the data. The authors tackle this problem by introducing a human-in-the-loop method in which human annotators are hired to modify data in order to change the meaning of the text or make it in a way that it represents the opposite label for example if a text had a positive sentiment to it, the annotators change the text such that it represents the negative sentiment label with minimal changes to the text. They refer to this process as counterfactual augmentation. The authors apply this method to the IMDB sentiment dataset and to SNLI and show that many models can not perform well on the augmented dataset if trained only on the original dataset.<br />
<br />
== Background == <br />
'''What are spurious patterns in NLP, and why do they occur?'''<br />
<br />
Current supervised machine learning systems try to learn the underlying features of input data that associate the inputs with the corresponding labels. Take Twitter sentiment analysis as an example, there might be lots of negative tweets about Donald Trump. If we use those tweets as training data, the ML systems tend to associate "Trump" with the label: Negative. However, the text itself is completely neutral. The association between the text trump and the label negative is spurious. One way to explain why this occurs is that association does not necessarily mean causation. For example, the color gold might be associated with success. But it does not cause success. Current ML systems might learn such undesired associations and then deduce from them. <br />
<br />
<br />
== Data Collection ==<br />
The authors used Amazon’s Mechanical Turk which is a crowdsourcing platform using to recruit editors. They hired these editors to revise each document. For sentiment analysis, they directed the annotators to revise this negative movie review to make it positive, without making any gratuitous changes. For the NLI tasks, which are 3-class classification tasks, the annotators were asked to modify the premise of the text while keeping the hypothesis intact and vice versa.<br />
<br />
After the data collection, a different set of workers was employed to verify whether the given label<br />
accurately described the relationship between each premise-hypothesis pair. Each pair was presented to 3 workers and the pair was only accepted if all 3 of the workers approved that the text is accurate. This entire process cost the authors about $10778.<br />
<br />
== Example ==<br />
In the picture below, we can see an example of spurious correlation and how the method presented here can address that. The picture shows the most important features learned by SVM. As we can see in the left plot, when the model is trained only on the original data, the word "horror" is associated with negative label and the word "romantic" is associated with the positive label. This is an example of spurious correlation, because we definitely can have both bad romantic and good horror movies. The middle plot shows the case that the model is trained only on the revised dataset. As we expected the situation is vice versa, that is, "horror" and "romantic" are associated to the positive and negative labels respectively. However, the problem is solved in the right plot where the authors trained the model on both the original and the revised datasets. The words "horror" and "romantic" are no longer among the most important features which is what we wanted.<br />
<br />
[[File: SVM features.png | center |800px]]<br />
<br />
== Experiments ==<br />
<br />
The authors carried out experiments on a total of 5 models: Support Vector Machines (SVMs), Naive Bayes<br />
(NB) classifiers, Bidirectional Long Short-Term Memory Networks, ELMo models with LSTM, and fine-tuned BERT models. Furthermore, they evaluated their models on Amazon reviews datasets aggregated over six genres, they also evaluated the models on twitters sentiment dataset and on Yelp reviews released as part of a Yelp dataset challenge. They showed that almost all cases, models trained on the counterfactually-augmented<br />
IMDb dataset perform better than models trained on comparable quantities of original data, this is shown in the table below.<br />
<br />
[[File:result1_syed.PNG]]<br />
<br />
== Conclusion ==<br />
<br />
The authors propose a new way to augment textual datasets for the task of sentiment analysis, this helps the learning methods used to generalize better by concentrating on learning the different that makes a difference. I believe that the main contribution of the paper is the introduction of the idea of counterfactual datasets for sentiment analysis. The paper proposes an interesting approach to tackle NLP problems, shows intriguing experimental results, and presents us with an interesting dataset that may be useful for future research.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data&diff=43360Learning The Difference That Makes A Difference With Counterfactually-Augmented Data2020-11-07T16:31:05Z<p>Mdadbin: </p>
<hr />
<div>== Presented by == <br />
Syed Saad Naseem<br />
<br />
== Introduction == <br />
This paper addresses the problem of building models for NLP tasks that are robust against spurious correlations in the data. The authors tackle this problem by introducing a human-in-the-loop method in which human annotators are hired to modify data in order to change the meaning of the text or make it in a way that it represents the opposite label for example if a text had a positive sentiment to it, the annotators change the text such that it represents the negative sentiment label with minimal changes to the text. They refer to this process as counterfactual augmentation. The authors apply this method to the IMDB sentiment dataset and to SNLI and show that many models can not perform well on the augmented dataset if trained only on the original dataset.<br />
<br />
== Background == <br />
'''What are spurious patterns in NLP, and why do they occur?'''<br />
<br />
Current supervised machine learning systems try to learn the underlying features of input data that associate the inputs with the corresponding labels. Take Twitter sentiment analysis as an example, there might be lots of negative tweets about Donald Trump. If we use those tweets as training data, the ML systems tend to associate "Trump" with the label: Negative. However, the text itself is completely neutral. The association between the text trump and the label negative is spurious. One way to explain why this occurs is that association does not necessarily mean causation. For example, the color gold might be associated with success. But it does not cause success. Current ML systems might learn such undesired associations and then deduce from them. <br />
<br />
<br />
== Data Collection ==<br />
The authors used Amazon’s Mechanical Turk which is a crowdsourcing platform using to recruit editors. They hired these editors to revise each document. For sentiment analysis, they directed the annotators to revise this negative movie review to make it positive, without making any gratuitous changes. For the NLI tasks, which are 3-class classification tasks, the annotators were asked to modify the premise of the text while keeping the hypothesis intact and vice versa.<br />
<br />
After the data collection, a different set of workers was employed to verify whether the given label<br />
accurately described the relationship between each premise-hypothesis pair. Each pair was presented to 3 workers and the pair was only accepted if all 3 of the workers approved that the text is accurate. This entire process cost the authors about $10778.<br />
<br />
== Example ==<br />
In the picture below, we can see an example of spurious correlation and how the method presented here can address that. The picture shows the most important features learned by SVM. As we can see in the left plot, when the model is trained only on the original data, the word "horror" is associated with negative label and the word "romantic" is associated to the positive label. This is an example of spurious correlation, because we definitely can have both bad romantic and good horror movies. The middle plot shows the case that the model is trained only on the revised dataset. As we expected the situation is vice versa, that is, "horror" and "romantic" are associated to the labels positive and negative respectively. However, the problem is solved when the authors trained the model on both the original and the revised datasets, because the words "horror" and "romantic" are no longer among the most important features.<br />
<br />
[[File: SVM features.png | center |800px]]<br />
<br />
<br />
== Experiments ==<br />
<br />
The authors carried out experiments on a total of 5 models: Support Vector Machines (SVMs), Naive Bayes<br />
(NB) classifiers, Bidirectional Long Short-Term Memory Networks, ELMo models with LSTM, and fine-tuned BERT models. Furthermore, they evaluated their models on Amazon reviews datasets aggregated over six genres, they also evaluated the models on twitters sentiment dataset and on Yelp reviews released as part of a Yelp dataset challenge. They showed that almost all cases, models trained on the counterfactually-augmented<br />
IMDb dataset perform better than models trained on comparable quantities of original data, this is shown in the table below.<br />
<br />
[[File:result1_syed.PNG]]<br />
<br />
== Conclusion ==<br />
<br />
The authors propose a new way to augment textual datasets for the task of sentiment analysis, this helps the learning methods used to generalize better by concentrating on learning the different that makes a difference. I believe that the main contribution of the paper is the introduction of the idea of counterfactual datasets for sentiment analysis. The paper proposes an interesting approach to tackle NLP problems, shows intriguing experimental results, and presents us with an interesting dataset that may be useful for future research.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:SVM_features.png&diff=43359File:SVM features.png2020-11-07T16:28:35Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data&diff=43358Learning The Difference That Makes A Difference With Counterfactually-Augmented Data2020-11-07T16:27:56Z<p>Mdadbin: </p>
<hr />
<div>== Presented by == <br />
Syed Saad Naseem<br />
<br />
== Introduction == <br />
This paper addresses the problem of building models for NLP tasks that are robust against spurious correlations in the data. The authors tackle this problem by introducing a human-in-the-loop method in which human annotators are hired to modify data in order to change the meaning of the text or make it in a way that it represents the opposite label for example if a text had a positive sentiment to it, the annotators change the text such that it represents the negative sentiment label with minimal changes to the text. They refer to this process as counterfactual augmentation. The authors apply this method to the IMDB sentiment dataset and to SNLI and show that many models can not perform well on the augmented dataset if trained only on the original dataset.<br />
<br />
== Background == <br />
'''What are spurious patterns in NLP, and why do they occur?'''<br />
<br />
Current supervised machine learning systems try to learn the underlying features of input data that associate the inputs with the corresponding labels. Take Twitter sentiment analysis as an example, there might be lots of negative tweets about Donald Trump. If we use those tweets as training data, the ML systems tend to associate "Trump" with the label: Negative. However, the text itself is completely neutral. The association between the text trump and the label negative is spurious. One way to explain why this occurs is that association does not necessarily mean causation. For example, the color gold might be associated with success. But it does not cause success. Current ML systems might learn such undesired associations and then deduce from them. <br />
<br />
<br />
== Data Collection ==<br />
The authors used Amazon’s Mechanical Turk which is a crowdsourcing platform using to recruit editors. They hired these editors to revise each document. For sentiment analysis, they directed the annotators to revise this negative movie review to make it positive, without making any gratuitous changes. For the NLI tasks, which are 3-class classification tasks, the annotators were asked to modify the premise of the text while keeping the hypothesis intact and vice versa.<br />
<br />
After the data collection, a different set of workers was employed to verify whether the given label<br />
accurately described the relationship between each premise-hypothesis pair. Each pair was presented to 3 workers and the pair was only accepted if all 3 of the workers approved that the text is accurate. This entire process cost the authors about $10778.<br />
<br />
== Example ==<br />
In the picture below, we can see an example of spurious correlation and how the method presented here can address that. The picture shows the most important features learned by SVM. As we can see in the left plot, when the model is trained only on the original data, the word "horror" is associated with negative label and the word "romantic" is associated to the positive label. This is an example of spurious correlation, because we definitely can have both bad romantic and good horror movies. The middle plot shows the case that the model is trained only on the revised dataset. As we expected the situation is vice versa, that is, "horror" and "romantic" are associated to the labels positive and negative respectively. However, the problem is solved when the authors trained the model on both the original and the revised datasets, because the words "horror" and "romantic" are no longer among the most important features.<br />
<br />
== Experiments ==<br />
<br />
The authors carried out experiments on a total of 5 models: Support Vector Machines (SVMs), Naive Bayes<br />
(NB) classifiers, Bidirectional Long Short-Term Memory Networks, ELMo models with LSTM, and fine-tuned BERT models. Furthermore, they evaluated their models on Amazon reviews datasets aggregated over six genres, they also evaluated the models on twitters sentiment dataset and on Yelp reviews released as part of a Yelp dataset challenge. They showed that almost all cases, models trained on the counterfactually-augmented<br />
IMDb dataset perform better than models trained on comparable quantities of original data, this is shown in the table below.<br />
<br />
[[File:result1_syed.PNG]]<br />
<br />
== Conclusion ==<br />
<br />
The authors propose a new way to augment textual datasets for the task of sentiment analysis, this helps the learning methods used to generalize better by concentrating on learning the different that makes a difference. I believe that the main contribution of the paper is the introduction of the idea of counterfactual datasets for sentiment analysis. The paper proposes an interesting approach to tackle NLP problems, shows intriguing experimental results, and presents us with an interesting dataset that may be useful for future research.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43342ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-06T04:19:09Z<p>Mdadbin: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass, but reduces the memory requirements in a technique called gradient checkpointing. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, in order to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is usually quite large, <math display="inline">\\V</math> is the size of the vocabulary and equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al, 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png | center |800px]]<br />
<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png | center |800px]]<br />
<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43341ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-06T04:18:45Z<p>Mdadbin: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass, but reduces the memory requirements in a technique called gradient checkpointing. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, in order to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is usually quite large, <math display="inline">\\V</math> is the size of the vocabulary and equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al, 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png | center |800px]]<br />
<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43340ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-06T04:18:15Z<p>Mdadbin: /* Removing dropout */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass, but reduces the memory requirements in a technique called gradient checkpointing. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, in order to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is usually quite large, <math display="inline">\\V</math> is the size of the vocabulary and equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al, 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png | center |800px]]<br />
<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43339ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-06T04:17:45Z<p>Mdadbin: /* Inter-sentence coherence loss */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass, but reduces the memory requirements in a technique called gradient checkpointing. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, in order to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is usually quite large, <math display="inline">\\V</math> is the size of the vocabulary and equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al, 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png]]<br />
<br />
<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43338ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-06T04:16:08Z<p>Mdadbin: /* Cross-layer parameter sharing */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass, but reduces the memory requirements in a technique called gradient checkpointing. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, in order to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is usually quite large, <math display="inline">\\V</math> is the size of the vocabulary and equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al, 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png]]<br />
<br />
<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=43337STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-06T04:14:54Z<p>Mdadbin: /* Comparison between ELMo, GPT, and BERT */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT is a step of progression for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right order. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction task to make the model understand the relationship between sentences. Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. <br />
<br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs. Note that among these three models, only BERT representations are jointly conditioned on both directions' context in all layers.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=43336STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-06T04:10:50Z<p>Mdadbin: /* Comparison between ELMo, GPT, and BERT */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT is a step of progression for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right order. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction task to make the model understand the relationship between sentences. Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. <br />
<br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
By looking at the above picture we can have a better understanding of the comparison between these three models. As mentioned above GPT is unidirectional which means the layers are not dense and only weights from left to right are present. BERT is bidirectional in the sense that both weight from left to right and from right to left are present (the layers are dense). ELMo is also bidirectional but not the same way as BERT. It actually uses a concatenation of independently trained left-to-right and right-to-left LSTMs.<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding&diff=43335STAT946F20/BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding2020-11-06T04:01:46Z<p>Mdadbin: /* Comparison between ELMo, GPT, and BERT */</p>
<hr />
<div>== Presented by == <br />
Wenyu Shen<br />
<br />
== Introduction == <br />
This paper introduces the structure of the BERT model. The full name of the BERT model is Bidirectional Encoder Representations from Transformers, and this language model breaks records in eleven natural language process tasks. BERT is a step of progression for pre-training of contextual representations. One novel feature as compared to Word2Vec or GLoVE, is the ability for BERT to produce different representations for a unique word given different contexts. To elaborate Word2Vec would always create the same embedding for a given word regardless of the words that precede and proceed it. BERT however, will generate different embeddings based on what precedes and proceeds it. This can be useful as words can have homonyms, such as "bank" where it could refer to a "bank" as a "financial institution" or the "land alongside or sloping down to a river or lake".<br />
<br />
== Transformer and BERT == <br />
Let us start with the introduction of encoder and decoder. From the class, the encoder-decoder model is applied in the seq2seq question. For the sea2seq question, if we input a sequence x, then through performing the encoder-decoder model, we could generate another output sequence y based on x (like translation, questions with answer system). However, while using the RNN or other models as the basic architecture of encoder-decoder, the model might not have great performance while the input source is too long. Though we can use the encoder-decoder with attention which does not merge all the output into one context(layer), the paper Attention is All You Need introduce a framework and only use Attention in the encoder-decoder to do the machine translation task. The Transformer utilized the Scaled Dot-Product Attention and the sequential mask in the decoder and usually performs Multi-head attention to derive more features from the different subspace of sentence for the individual token. The transformer trained the positional encoding, which has the same dimension as the word embedding, to obtain the sequential information of the inputs. BERT is built by the N unit of the transformer encoder. <br />
<br />
[[File:Transformer Structure.png | center |800px]]<br />
<br />
<div align="center">Table 1: Transformer Structure </div><br />
<br />
== BERT ==<br />
BERT works well in both the Feature-based and the Fine-tuning approaches. Both Feature-based and Fine-tuning structures started with unsupervised learning from source A. While the Feature-based approach keeps the pre-trained parameters fixed while using the labeled source B to train the task-specific model and get the additional feature, the Fine-tuning approach tunes all parameters when training on the afterword task. This paper improves BERT based on the Fine-tuning approach. Original transformer learned from left to right order. Therefore, BERT used the MLM (masked language model) to pre-train deep bidirectional Transformers. In this pretraining method, some random tokens are masked each time and the model's objective is to find the vocabulary id of the masked token based on both its left and its right contexts. Also, BERT performs the Next Sentence Prediction task to make the model understand the relationship between sentences. Also, the Input/Output Representation created Token Embeddings, Segment Embeddings, and Position Embeddings to make BERT accomplish a variety of downstream tasks. Additionally, during this paper, the randomly selected tokens in MLM are not always utilized by mask to solve the unmatched issue while pre-training and fine-tuning models. <br />
<br />
[[File:Token embedding.png | center | 800px]]<br />
<br />
<div align="center">Table 2: Token embedding</div><br />
<br />
== Applications ==<br />
<br />
As previously mentioned BERT has achieved state-of-the-art performance in eleven NLP tasks. BERT can even be trained on different corpora/data as seen in figure 1 and then different pre-training and fine-tuning can be applied downstream, this landscape is surely not exhaustive. This aids in showing the wide range of applications BERT can be completely retrained for.<br />
<br />
[[File:application_landscape.png| center |1000px|Image: 1000 pixels]]<br />
<br />
<div align="center">Figure 1: Landscape of BERT Applications</div><br />
<br />
== Comparison between ELMo, GPT, and BERT ==<br />
In this section, we are going to compare BERT with previous language models, in particular, ELMo and GPT. These three models are among the biggest advancements in NLP. ELMo is a bi-directional LSTM model and is able to capture context information from both directions. It's a feature-based approach, which means the pre-trained representations are used as features. GPT and BERT are both transformer-based models. GPT only uses transformer decoders and is unidirectional. This means information only flows from the left to the right in GPT. In contrast, BERT only uses transformer encoders and is bidirectional. Therefore, it's able to capture more context information than GPT and tend to perform better when context information from both sides are important. GPT and BERT are fine-tuning-based approaches. Users can use the models on downstream tasks by simply fine-tuning model parameters.<br />
<br />
[[File:comparison_paper5.png | center |800px]]<br />
<br />
== Conclusion ==<br />
<br />
Consequently, BERT is a powerful pre-trained model in a large number of unsupervised resources and contributes when we want to perform NLP tasks with a low amount of obtained data.<br />
<br />
<br />
[[File:Result.png | center |800px]]<br />
<br />
<div align="center">Table 3: Performance of BERT in multiple datasets</div><br />
<br />
<br />
<br />
== References ==<br />
[1] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin.<br />
"Attention Is All You Need". (2017)<br />
<br />
[2] <br />
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language".(2019)</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:comparison_paper5.png&diff=43334File:comparison paper5.png2020-11-06T03:59:42Z<p>Mdadbin: </p>
<hr />
<div></div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43237ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-03T01:41:44Z<p>Mdadbin: /* Critiques */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed (look at Chen et al. (2016), Gomez et al. (2017), and also Raffel et al. (2019)). The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
<br />
<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have E=H i.e. the size of the vocabulary embedding (E) and the size of the hidden layer (H) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘E’ is meant to learn context-independent representations while the hidden-layer embedding ‘H’ is meant to learn context-dependent representation which usually is harder. However if we increase H and E together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is V×E where V is usually quite large (V is the size of the vocabulary and equals 30000 in both BERT and ALBERT). <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size E and then project it to the hidden layer. This reduces embedding parameters from O(V×H) to O(V×E+E×H) which is significant when H is much larger than E.<br />
<br />
<br />
<br />
<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png]]<br />
<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compered to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. AS we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png]]<br />
<br />
<br />
==conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43236ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-03T01:38:46Z<p>Mdadbin: /* Inter-sentence coherence loss */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed (look at Chen et al. (2016), Gomez et al. (2017), and also Raffel et al. (2019)). The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
<br />
<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have E=H i.e. the size of the vocabulary embedding (E) and the size of the hidden layer (H) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘E’ is meant to learn context-independent representations while the hidden-layer embedding ‘H’ is meant to learn context-dependent representation which usually is harder. However if we increase H and E together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is V×E where V is usually quite large (V is the size of the vocabulary and equals 30000 in both BERT and ALBERT). <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size E and then project it to the hidden layer. This reduces embedding parameters from O(V×H) to O(V×E+E×H) which is significant when H is much larger than E.<br />
<br />
<br />
<br />
<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png]]<br />
<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compered to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. AS we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to a acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png]]<br />
<br />
<br />
==conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results if let BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge start to do the give the higher accuracy.</div>Mdadbinhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=43235ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-03T01:37:07Z<p>Mdadbin: </p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer number of parameters than BERT-large, but still it gets better results. The above mentioned changes are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropout from the model.<br />
<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption but at the cost of speed (look at Chen et al. (2016), Gomez et al. (2017), and also Raffel et al. (2019)). The authors of this paper claim that there parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
<br />
<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 where H is the size of the hidden layer. Next we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have E=H i.e. the size of the vocabulary embedding (E) and the size of the hidden layer (H) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘E’ is meant to learn context-independent representations while the hidden-layer embedding ‘H’ is meant to learn context-dependent representation which usually is harder. However if we increase H and E together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is V×E where V is usually quite large (V is the size of the vocabulary and equals 30000 in both BERT and ALBERT). <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size E and then project it to the hidden layer. This reduces embedding parameters from O(V×H) to O(V×E+E×H) which is significant when H is much larger than E.<br />
<br />
<br />
<br />
<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing for example one may only share feed-forward network parameters or only sharing attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters not from sharing attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png]]<br />
<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modeling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss were positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However experiments show that NSP is not effective. The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compered to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again A binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. AS we can see the NSP cannot solve the SOP task (it performs at random) but the SOP can solve the NSP task to a acceptable degree. We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. This decision is supported by the following table. They also have an observation that even after 1M step of training the model is not overfitting to the data.<br />
[[File:dropout.png]]<br />
<br />
<br />
==conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all of the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of same number of steps. Here is the results:<br />
[[File:sameTime.png]]<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results if let BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favor of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge start to do the give the higher accuracy.</div>Mdadbin