http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Jlavilez&feedformat=atomstatwiki - User contributions [US]2023-02-01T11:40:40ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation&diff=49635One-Shot Object Detection with Co-Attention and Co-Excitation2020-12-07T01:34:38Z<p>Jlavilez: </p>
<hr />
<div>== Presented By ==<br />
Gautam Bathla<br />
<br />
== Background ==<br />
<br />
Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image. The aim is to take a query image patch whose class label is not included in the training data and detect all instances of the same class in a target image.<br />
<br />
[[File:object_detection.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Object Detection on an image</div><br />
<br />
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.<br />
<br />
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and K > 1 for few-shot learning.<br />
<br />
== Introduction ==<br />
<br />
This paper tackles the problem of one-shot object detection, where the model needs to find all the instances in the target image of the object in the query image for a given query image ''p''. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category. In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.<br />
<br />
== Previous Work ==<br />
<br />
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:<br />
<br />
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN [1].<br />
<br />
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet [2].<br />
<br />
The work done to tackle the problem of few-shot object detection is based on transfer learning [3], meta-learning [4], and metric-learning.<br />
<br />
1) Transfer Learning: Chen et al. [3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.<br />
<br />
2) Meta-Learning: Kang et al. [4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.<br />
<br />
3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.<br />
<br />
== Approach ==<br />
<br />
Let <math> C </math> be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:<br />
<br />
<div style="text-align: center;"><math> C = C_0 \bigsqcup C_1,</math></div><br />
<br />
where <math>C_0</math> represents the classes that the model is trained on and <math>C_1</math> represents the classes on which the inference is done.<br />
<br />
[[File:architecture_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Architecture</div><br />
<br />
Figure 2 shows the architecture of the model proposed in this paper. The model architecture is based on FasterRCNN [1], and ResNet-50 [5] has been used as the backbone for extracting features from the images. The target image and the query image are first passed through the ResNet-50 module to extract the features from the same convolutional layer. The features obtained are next passed into the Non-local block as input and the output consists of weighted features for each of the images. The new weighted feature set for both images is passed through Squeeze and Co-excitation block which outputs the re-weighted features which act as an input to the Region Proposal Network (RPN) module. RCNN module also consists of a new loss that is designed by the authors to rank proposals in order of their relevance.<br />
<br />
==== Non-Local Object Proposals ====<br />
<br />
The need for non-local object proposals arises because the RPN module used in Faster R-CNN [1] has access to bounding box information for each class in the training dataset. The dataset used for training and inference in the case of Faster R-CNN [1] is not exclusive. In this problem, as we have defined above that we divide the dataset into two parts, one part is used for training and the other is used during inference. Therefore, the classes in the two sets are exclusive. If the conventional RPN module is used, then the module will not be able to generate good proposals for images during inference because it will not have any information about the presence of bounding-box for those classes.<br />
<br />
To resolve this problem, a non-local operation is applied to both sets of features. This non-local operation is defined as:<br />
\begin{align}<br />
y_i = \frac{1}{C(z)} \sum_{\forall j}^{} f(x_i, z_j)g(z_j) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
where ''x'' is a vector on which this operation is applied, ''z'' is a vector which is taken as an input reference, ''i'' is the index of output position, ''j'' is the index that enumerates over all possible positions, ''C(z)'' is a normalization factor, <math>f(x_i, z_j)</math> is a pairwise function like Gaussian, Dot product, concatenation, etc., <math>g(z_j)</math> is a linear function of the form <math>W_z \times z_j</math>, and ''y'' is the output of this operation.<br />
<br />
Let the feature maps obtained from the ResNet-50 model be <math> \phi{(I)} \in R^{N \times W_I \times H_I} </math> for target image ''I'' and <math> \phi{(p)} \in R^{N \times W_p \times H_p} </math> for query image ''p''. Taking <math> \phi{(p)} </math> as the input reference, the non-local operation is applied to <math> \phi{(I)} </math> and results in a non-local block, <math> \psi{(I;p)} \in R^{N \times W_I \times H_I} </math> . Analogously, we can derive the non-local block <math> \psi{(p;I)} \in R^{N \times W_p \times H_p} </math> using <math> \phi{(I)} </math> as the input reference. <br />
<br />
We can express the extended feature maps as:<br />
<br />
\begin{align}<br />
{F(I) = \phi{(I)} \oplus \psi{(I;p)} \in R^{N \times W_I \times H_I}} \&nbsp;\&nbsp;;\&nbsp;\&nbsp; {F(p) = \phi{(p)} \oplus \psi{(p;I)} \in R^{N \times W_p \times H_p}} \tag{2} \label{eq:o1}<br />
\end{align}<br />
<br />
where ''F(I)'' denotes the extended feature map for target image ''I'', ''F(p)'' denotes the extended feature map for query image ''p'' and <math>\oplus</math> denotes element-wise sum over the feature maps <math>\phi{}</math> and <math>\psi{}</math>.<br />
<br />
As can be seen above, the extended feature set for the target image ''I'' do not only contain features from ''I'' but also the weighted sum of the target image and the query image. The same can be observed for the query image. This weighted sum is a co-attention mechanism and with the help of extended feature maps, better proposals are generated when inputted to the RPN module.<br />
<br />
==== Squeeze and Co-Excitation ====<br />
<br />
The two feature maps generated from the non-local block above can be further related by identifying the important channels and therefore, re-weighting the weights of the channels. This is the basic purpose of this module. The Squeeze layer summarizes each feature map by applying Global Average Pooling (GAP) on the extended feature map for the query image. The Co-Excitation layer gives attention to feature channels that are important for evaluating the similarity metric. The whole block can be represented as:<br />
<br />
\begin{align}<br />
SCE(F(I), F(p)) = w \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{p}) = w \odot F(p) \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{I}) = w \odot F(I)\tag{3} \label{eq:op2}<br />
\end{align}<br />
<br />
where ''w'' is the excitation vector, <math>F(\tilde{p})</math> and <math>F(\tilde{I})</math> are the re-weighted features maps for query and target image respectively.<br />
<br />
In between the Squeeze layer and Co-Excitation layer, there exist two fully-connected layers followed by a sigmoid layer which helps to learn the excitation vector ''w''. The ''Channel Attention'' module in the architecture is basically these fully-connected layers followed by a sigmoid layer.<br />
<br />
==== Margin-based Ranking Loss ====<br />
<br />
The authors have defined a two-layer MLP network ending with a softmax layer to learn a similarity metric which will help rank the proposals generated by the RPN module. In the first stage of training, each proposal is annotated with 0 or 1 based on the IoU value of the proposal with the ground-truth bounding box. If the IoU value is greater than 0.5 then that proposal is labeled as 1 (foreground) and 0 (background) otherwise.<br />
<br />
Let ''q'' be the feature vector obtained after applying GAP to the query image patch obtained from the Squeeze and Co-Excitation block and ''r'' be the feature vector obtained after applying GAP to the region proposals generated by the RPN module. The two vectors are concatenated to form a new vector ''x'' which is the input to the two-layer MLP network designed. We can define ''x = [<math>r^T;q^T</math>]''. Let ''M'' be the model representing the two-layer MLP network, then <math>s_i = M(x_i)</math>, where <math>s_i</math> is the probability of <math>i^{th}</math> proposal being a foreground proposal based on the query image patch ''q''.<br />
<br />
The margin-based ranking loss is given by:<br />
<br />
\begin{align}<br />
L_{MR}(\{x_i\}) = \sum_{i=1}^{K}y_i \times max\{m^+ - s_i, 0\} + (1-y_i) \times max\{s_i - m^-, 0\} + \delta_{i} \tag{4} \label{eq:op3}<br />
\end{align}<br />
\begin{align}<br />
\delta_{i} = \sum_{j=i+1}^{K}[y_i = y_j] \times max\{|s_i - s_j| - m^-, 0\} + [y_i \ne y_j] \times max\{m^+ - |s_i - s_j|, 0\} \tag{5} \label{eq:op4}<br />
\end{align}<br />
<br />
where ''[.]'' is the Iversion bracket, i.e. the output will be 1 if the condition inside the bracket is true and 0 otherwise, <math>m^+</math> is the expected lower bound probability for predicting a foreground proposal, <math>m^-</math> is the expected upper bound probability for predicting a background proposal and <math>K</math> is the number of candidate proposals from RPN.<br />
<br />
The total loss for the model is given as:<br />
<br />
\begin{align}<br />
L = L_{CE} + L_{Reg} + \lambda \times L_{MR} \tag{6} \label{eq:op5}<br />
\end{align}<br />
<br />
where <math>L_{CE}</math> is the cross-entropy loss, <math>L_{Reg}</math> is the regression loss for bounding boxes of Faster R-CNN [1] and <math>L_{MR}</math> is the margin-based ranking loss defined above.<br />
<br />
For this paper, <math>m^+</math> = 0.7, <math>m^-</math> = 0.3, <math>\lambda</math> = 3, K = 128, C(z) in \eqref{eq:op} is the total number of elements in a single feature map of vector ''z'', and <math>f(x_i, z_j)</math> in \eqref{eq:op} is a dot product operation.<br />
\begin{align}<br />
f(x_i, z_j) = \alpha(x_i)^T \beta(z_j)\&nbsp;\&nbsp;;\&nbsp;\&nbsp;\alpha(x_i) = W_{\alpha} x_i \&nbsp;\&nbsp;;\&nbsp;\&nbsp; \beta(z_j) = W_{\beta} z_j \tag{7} \label{eq:op6}<br />
\end{align}<br />
<br />
== Results ==<br />
<br />
The model is trained and tested on two popular datasets, VOC and COCO. The ResNet-50 model was pre-trained on a reduced dataset by removing all the classes present in the COCO dataset, thus ensuring that the model has not seen any of the classes belonging to the inference images.<br />
<br />
==== Results on VOC Dataset ====<br />
<br />
[[File: voc_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 1:''' Results on VOC dataset</div><br />
<br />
For the VOC dataset, the model is trained on the union of VOC 2007 train and validation sets and VOC 2012 train and validation sets, whereas the model is tested on VOC 2007 test set. From the VOC results (Table 1), it can be seen that the model with pre-trained ResNet-50 on a reduced training set as the CNN backbone (Ours(725)) achieves better performance on seen and unseen classes than the baseline models. When the pre-trained ResNet-50 on the full training set (Ours(1K)) is used as the CNN backbone, then the performance of the model is increased significantly.<br />
<br />
==== Results on MSCOCO Dataset ====<br />
<br />
[[File: mscoco_splits.png|750px|center|Image: 500 pixels]]<br />
[[File: mscoco_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2:''' Results on COCO dataset</div><br />
<br />
The model is trained on the COCO train2017 set and evaluated on the COCO val2017 set. The classes are divided into four groups and the model is trained with images belonging to three splits, whereas the evaluation is done on the images belonging to the fourth split. From Table 2, it is visible that the model achieved better accuracy than the baseline model. The bar chart value in the split figure shows the performance of the model on each class separately. The model is having some difficulties when predicting images belonging to classes like the book (split2), handbag (split3), and tie (split4) because of variations in their shape and textures.<br />
<br />
==== Overall Performance ====<br />
For VOC, the model that uses the reduced ImageNet model backbone with 725 classes achieves a better performance on both the seen and unseen classes. Remarkable improvements in the performance are seen with the backbone with 1000 classes. For COCO, the model achieves better accuracy than the Siamese Mask-RCNN model for both the seen and unseen classes.<br />
<br />
== Ablation Studies ==<br />
<br />
==== Effect of all the proposed techniques on the final result ====<br />
<br />
[[File: one_shot_detector_results.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 3:''' Effect of all thre techniques combined</div><br />
<br />
Figure 3 shows the effect of the three proposed techniques on the evaluation metric. The model performs worst when neither Co-attention nor Co-excitation mechanism is used. But, when either Co-attention or Co-excitation is used then the performance of the model is improved significantly. The model performs best when all the three proposed techniques are used.<br />
<br />
<br />
In order to understand the effect of the proposed modules, the authors analyzed each module separately.<br />
<br />
==== Visualizing the effect of Non-local RPN ====<br />
<br />
To demonstrate the effect of Non-local RPN, a heatmap of generated proposals is constructed. Each pixel is assigned the count of how many proposals cover that particular pixel and the counts are then normalized to generate a probability map.<br />
<br />
[[File: one_shot_non_local_rpn.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 4:''' Visualization of Non-local RPN</div><br />
<br />
From Figure 4, it can be seen that when a non-local RPN is used instead of a conventional RPN, the model is able to give more attention to the relevant region in the target image.<br />
<br />
==== Analyzing and Visualizing the effect of Co-Excitation ====<br />
<br />
To visualize the effect of excitation vector ''w'', the vector is calculated for all images in the inference set which are then averaged over images belonging to the same class, and a pair-wise Euclidean distance between classes is calculated.<br />
<br />
[[File: one_shot_excitation.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 5:''' Visualization of Co-Excitation</div><br />
<br />
From Figure 5, it can be observed that the Co-Excitation mechanism is able to assign meaningful weight distribution to each class. The weights for classes related to animals are closer to each other and the ''person'' class is not close to any other class because of the absence of common attributes between ''person'' and any other class in the dataset.<br />
<br />
[[File: analyzing_co_excitation_1.png|Analyzing Co-Exitation|500px|left|bottom|Image: 500 pixels]]<br />
<br />
[[File: analyzing_co_excitation_2.png|Analyzing Co-Excitation|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 6:''' Analyzing Co-Exitation</div><br />
<br />
To analyze the effect of Co-Excitation, the authors used two different scenarios. In the first scenario (Figure 6, left), the same target image is used for different query images. <math>p_1</math> and <math>p_2</math> query images have a similar color as the target image whereas <math>p_3</math> and <math>p_4</math> query images have a different color object as compared to the target image. When the pair-wise Euclidean distance between the excitation vector in the four cases was calculated, it can be seen that <math>w_2</math> was closer to <math>w_1</math> as compared to <math>w_4</math> and <math>w_3</math> was closer to <math>w_4</math> as compared to <math>w_1</math>. Therefore, it can be concluded that <math>w_1</math> and <math>w_2</math> give more importance to the texture of the object whereas <math>w_3</math> and <math>w_4</math> give more importance to channels representing the shape of the object.<br />
<br />
The same observation can be analyzed in scenario 2 (Figure 6, right) where the same query image was used for different target images. <math>w_1</math> and <math>w_2</math> are closer to <math>w_a</math> than <math>w_b</math> whereas <math>w_3</math> and <math>w_4</math> are closer to <math>w_b</math> than <math>w_a</math>. Since images <math>I_1</math> and <math>I_2</math> have a similar color object as the query image, we can say that <math>w_1</math> and <math>w_2</math> give more weightage to the channels representing the texture of the object, and <math>w_3</math> and <math>w_4</math> give more weightage to the channels representing shape.<br />
<br />
== Conclusion ==<br />
<br />
The resulting one-shot object detector outperforms all the baseline models on VOC and COCO datasets. The authors have also provided insights about how the non-local proposals, serving as a co-attention mechanism, can generate relevant region proposals in the target image and put emphasis on the important features shared by both target and query image.<br />
<br />
== Related Work ==<br />
<br />
'''Object detection''' SOTA object detectors are mostly deep convolutional neural networks. A popular pipeline is a two-stage approach, where detectors first generate a set of region proposals and then classify the proposals. The latest version is called Faster R-CNN [6], which works by replacing the grouping-based proposals originally found in R-CNN with a regional proposal network.<br />
<br />
'''Few-shot classification via metric learning''' The aim of this is to derive a similarity metric that can be used to infer unseen classes. One approach is to use Siamese networks, which learn a general similarity metric from using paired training data to decide whether the pair belongs to the same class. Then, during inference, the network matches unlabelled observations with a one-shot support set, where classification is done by asking which observed class is most similar [7].<br />
<br />
'''Few-shot object detection''' The problem of detection, like classification, can be assessed in a few-shot setting. However, the problem is quite novel so only preliminary results from transfer learning, meta learning, and metric learning exist.<br />
<br />
== Critiques ==<br />
<br />
The techniques proposed by the authors improve the performance of the model significantly as we saw that when either of Co-attention or Co-excitation is used along with Margin-based ranking loss then the model can detect the instances of query object in the target image. Also, the model trained is generic and does not require any training/fine-tuning to detect any unseen classes in the target image. The loss metric designed makes the learning process not to rely on only the labels of images since the proposed metric annotates each proposal as a foreground or a background which is then used to calculate the metric.<br />
Since it is exploiting many deep neural networks inside the main architecture, one critique that comes across is how time-consuming the proposed model is. The paper could have elucidated it more thoroughly whether the method is too time-consuming or not.<br />
<br />
== Source Code==<br />
[https://github.com/timy90022/One-Shot-Object-Detection link One-Shot-Object-Detection]<br />
<br />
== References ==<br />
<br />
[1] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[2] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 765–781, 2018<br />
<br />
[3] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2836–2843, 2018.<br />
<br />
[4] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. CoRR, abs/1812.01866, 2018.<br />
<br />
[5] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.<br />
<br />
[6] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[7] Gregory R. Koch. Siamese neural networks for one-shot image recognition. 2015.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=49626CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-07T01:01:51Z<p>Jlavilez: /* Method & Experiment */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of the state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way to find the rotation axis, as can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods. <br />
<br />
* Generative models: Generative Adversarial Networks (GANs), learn to generate images in an adversarial manner. They consist of a generator network which maps noise samples to image samples and a discriminator network whose task is to distinguish the fake images from the real ones. These two are trained together until the point where the fake images are indistinguishable. BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. <br />
* In RotNet method [3], images are rotated and the CNN learns to figure out the direction. Therefore, this task is a 4-way classification task. Most images are taken upright which could be considered as labeled images with label 90 degrees. The authors of RotNet argue that the concept of 'upright' is hard to understand and requires high-level knowledge about the image, so this task encourages the network to discover more complex information about the images. <br />
* DeepCluster [4] alternates between k-means clustering step, in which pseudo-labels are assigned to the data by k-means on the PCA-reduced features, and the learning step in which the model tries to learn to fit the representation to these labels(cluster IDs) under several image transformations. These transformations include random resized crops with <math> \beta = 0.08 </math> and <math> \gamma = \frac{3}{4}</math> and horizontal flips.<br />
<br />
* In Jigsaw task [6], the unlabelled images are divided into nine patches and then, the patches are permuted randomly to create a new image. Then, a deep neural network is trained to predict the permutation of patches in the perturbed image.<br />
<br />
Following is the work done in the domain of learning from a single image:<br />
<br />
* Rodriguez et al. [7] used max-margin correlation filters to learn robust tracking templates from a single sample of the patch.<br />
* Malisiewicz et al. [8] used a semi-parametric exemplar SVM model where the model uses one positive sample and separates it from thousands of negative samples mined from the background.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet, and DeepCluster are employed for training AlexNet in a self-supervised manner. The author uses the ResNet-50 to compute the image and the transpose of this image. The method is evaluated by multiple datasets, and the tasks majorly focus on object detection and image classification. Jigsaw ResNet-50, introduced by Priya Goyal, was utilized as a baseline of the experiment. <br />
<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. Augmentation can be seen as imposing a prior on how we expect the manifold of natural images to look like. When training with very few images, these priors become more important since the model cannot extract them directly from data.<br />
<br />
To measure the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, the linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
=== Choice of augmentations ===<br />
<br />
Here we describe how <math>N</math> surce images get expanded to an additional <math>d-N</math>images, where <math>d</math> is much larger and independent to <math>N</math>. <br />
<br />
Given a source image of size <math>H \times W</math>, extract random patches of size <math>(w,h)</math>. Set <math>\beta , \gamma </math> such that <math>\beta \leq \frac{wh}{WH}</math> and <math>\gamma \leq \frac{h}{w} \leq \gamma^{-1}</math>. The smalles size of crops is at least <math>\beta WH</math>. Changes in aspect ratio are limited by <math>\gamma</math>. In practice <math>\beta = 0.0001, \gamma = 0.75</math> are good choices.<br />
<br />
Second, images are rotated by <math>\alpha</math> degrees, where <math>-35 \leq \alpha \leq 35</math>. Images are flipped with 50% probability.<br />
<br />
Finally, colour and intensity of single pixels are linearly transformed to provide changes of illumination, as is common in natural images.<br />
<br />
=== Quantitative Analysis ===<br />
They compared the learned filters of all first-layer convolutions of an AlexNet trained with the different methods and a single image. Showed how the results of retraining a network with the first two convolutional filters, or the scattering transform from (Oyallon et al., 2017), left frozen. They also observed that their single image trained DeepCluster and BiGAN models achieve performances closes to the supervised benchmark. Lastly, they show how their features trained on only a single image can be used for other applications.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. Ttable1 indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to the results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
[[File:Capture123.PNG|500px|center]]<br />
<div align="center">'''Table1 :''' ImageNet LSVRC-12 linear probing evaluation. Activations of pretrained layers are used to train a linear classifier. </div><br />
<br />
<br />
[[File:critical_analysis.png|500px|center]]<br />
<br />
The above table (Table3) corresponds to the Accuracy of linear classifiers on different network layers on CIFAR-10 and CIFAR-100 datasets.<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conducted interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images were not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== Critique == <br />
This is a well-written paper. However, as the main contribution of the paper is experimental, I expected a more in-depth analysis. For example, it is interesting to see how these results change if we change AlexNet with a more powerful CNN like EfficientNet? Also, the authors could try other types of Self-Supervised tasks such as jigsaw task and state-of-the-art PIRL [8].<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.<br />
<br />
[7] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-SVMs for object detection and beyond. In<br />
Proc. ICCV, 2011.<br />
<br />
[8] A. Rodriguez, V. Naresh Boddeti, BVK V. Kumar, and A. Mahalanobis. Maximum margin correlation filter: A new approach for localization and classification. IEEE Transactions on Image Processing, 22(2):631–643, 2013<br />
<br />
[9] I. Misra and L. van der Maaten, "Self-Supervised Learning of Pretext-Invariant Representations," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation&diff=49622One-Shot Object Detection with Co-Attention and Co-Excitation2020-12-07T00:51:18Z<p>Jlavilez: /* Approach */</p>
<hr />
<div>== Presented By ==<br />
Gautam Bathla<br />
<br />
== Background ==<br />
<br />
Object Detection is a technique where the model gets an image as an input and outputs the class and location of all the objects present in the image. The aim is to take a query image patch whose class label is not included in the training data and detect all instances of the same class in a target image.<br />
<br />
[[File:object_detection.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Object Detection on an image</div><br />
<br />
Figure 1 shows an example where the model identifies and locates all the instances of different objects present in the image successfully. It encloses each object within a bounding box and annotates each box with the class of the object present inside the box.<br />
<br />
State-of-the-art object detectors are trained on thousands of images for different classes before the model can accurately predict the class and spatial location for unseen images belonging to the classes the model has been trained on. When a model is trained with K labeled instances for each of N classes, then this setting is known as N-way K-shot classification. K = 0 for zero-shot learning, K = 1 for one-shot learning and K > 1 for few-shot learning.<br />
<br />
== Introduction ==<br />
<br />
This paper tackles the problem of one-shot object detection, where the model needs to find all the instances in the target image of the object in the query image for a given query image ''p''. The target and query image do not need to be exactly the same and are allowed to have variations as long as they share some attributes so that they can belong to the same category. In this paper, the authors have made contributions to three technical areas. First is the use of non-local operations to generate better region proposals for the target image based on the query image. This operation can be thought of as a co-attention mechanism. The second contribution is proposing a Squeeze and Co-Excitation mechanism to identify and give more importance to relevant features to filter out relevant proposals and hence the instances in the target image. Third, the authors designed a margin-based ranking loss which will be useful for predicting the similarity of region proposals with the given query image irrespective of whether the label of the class is seen or unseen during the training process.<br />
<br />
== Previous Work ==<br />
<br />
All state-of-the-art object detectors are variants of deep convolutional neural networks. There are two types of object detectors:<br />
<br />
1) Two-Stage Object Detectors: These types of detectors generate region proposals in the first stage whereas classify and refine the proposals in the second stage. Eg. FasterRCNN [1].<br />
<br />
2) One Stage Object Detectors: These types of detectors directly predict bounding boxes and their corresponding labels based on a fixed set of anchors. Eg. CornerNet [2].<br />
<br />
The work done to tackle the problem of few-shot object detection is based on transfer learning [3], meta-learning [4], and metric-learning.<br />
<br />
1) Transfer Learning: Chen et al. [3] proposed a regularization technique to reduce overfitting when the model is trained on just a few instances for each class belonging to unseen classes.<br />
<br />
2) Meta-Learning: Kang et al. [4] trained a meta-model to re-weight the learned weights of an image extracted from the base model.<br />
<br />
3) Metric-Learning: These frameworks replace the conventional classifier layer with the metric-based classifier layer.<br />
<br />
== Approach ==<br />
<br />
Let <math> C </math> be the set of classes for this object detection task. Since one-shot object detection task needs unseen classes during inference time, therefore we divide the set of classes into two categories as follows:<br />
<br />
<div style="text-align: center;"><math> C = C_0 \bigsqcup C_1,</math></div><br />
<br />
where <math>C_0</math> represents the classes that the model is trained on and <math>C_1</math> represents the classes on which the inference is done.<br />
<br />
[[File:architecture_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Architecture</div><br />
<br />
Figure 2 shows the architecture of the model proposed in this paper. The model architecture is based on FasterRCNN [1], and ResNet-50 [5] has been used as the backbone for extracting features from the images. The target image and the query image are first passed through the ResNet-50 module to extract the features from the same convolutional layer. The features obtained are next passed into the Non-local block as input and the output consists of weighted features for each of the images. The new weighted feature set for both images is passed through Squeeze and Co-excitation block which outputs the re-weighted features which act as an input to the Region Proposal Network (RPN) module. RCNN module also consists of a new loss that is designed by the authors to rank proposals in order of their relevance.<br />
<br />
==== Non-Local Object Proposals ====<br />
<br />
The need for non-local object proposals arises because the RPN module used in Faster R-CNN [1] has access to bounding box information for each class in the training dataset. The dataset used for training and inference in the case of Faster R-CNN [1] is not exclusive. In this problem, as we have defined above that we divide the dataset into two parts, one part is used for training and the other is used during inference. Therefore, the classes in the two sets are exclusive. If the conventional RPN module is used, then the module will not be able to generate good proposals for images during inference because it will not have any information about the presence of bounding-box for those classes.<br />
<br />
To resolve this problem, a non-local operation is applied to both sets of features. This non-local operation is defined as:<br />
\begin{align}<br />
y_i = \frac{1}{C(z)} \sum_{\forall j}^{} f(x_i, z_j)g(z_j) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
where ''x'' is a vector on which this operation is applied, ''z'' is a vector which is taken as an input reference, ''i'' is the index of output position, ''j'' is the index that enumerates over all possible positions, ''C(z)'' is a normalization factor, <math>f(x_i, z_j)</math> is a pairwise function like Gaussian, Dot product, concatenation, etc., <math>g(z_j)</math> is a linear function of the form <math>W_z \times z_j</math>, and ''y'' is the output of this operation.<br />
<br />
Let the feature maps obtained from the ResNet-50 model be <math> \phi{(I)} \in R^{N \times W_I \times H_I} </math> for target image ''I'' and <math> \phi{(p)} \in R^{N \times W_p \times H_p} </math> for query image ''p''. Taking <math> \phi{(p)} </math> as the input reference, the non-local operation is applied to <math> \phi{(I)} </math> and results in a non-local block, <math> \psi{(I;p)} \in R^{N \times W_I \times H_I} </math> . Analogously, we can derive the non-local block <math> \psi{(p;I)} \in R^{N \times W_p \times H_p} </math> using <math> \phi{(I)} </math> as the input reference. <br />
<br />
We can express the extended feature maps as:<br />
<br />
\begin{align}<br />
{F(I) = \phi{(I)} \oplus \psi{(I;p)} \in R^{N \times W_I \times H_I}} \&nbsp;\&nbsp;;\&nbsp;\&nbsp; {F(p) = \phi{(p)} \oplus \psi{(p;I)} \in R^{N \times W_p \times H_p}} \tag{2} \label{eq:o1}<br />
\end{align}<br />
<br />
where ''F(I)'' denotes the extended feature map for target image ''I'', ''F(p)'' denotes the extended feature map for query image ''p'' and <math>\oplus</math> denotes element-wise sum over the feature maps <math>\phi{}</math> and <math>\psi{}</math>.<br />
<br />
As can be seen above, the extended feature set for the target image ''I'' do not only contain features from ''I'' but also the weighted sum of the target image and the query image. The same can be observed for the query image. This weighted sum is a co-attention mechanism and with the help of extended feature maps, better proposals are generated when inputted to the RPN module.<br />
<br />
==== Squeeze and Co-Excitation ====<br />
<br />
The two feature maps generated from the non-local block above can be further related by identifying the important channels and therefore, re-weighting the weights of the channels. This is the basic purpose of this module. The Squeeze layer summarizes each feature map by applying Global Average Pooling (GAP) on the extended feature map for the query image. The Co-Excitation layer gives attention to feature channels that are important for evaluating the similarity metric. The whole block can be represented as:<br />
<br />
\begin{align}<br />
SCE(F(I), F(p)) = w \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{p}) = w \odot F(p) \&nbsp;\&nbsp;;\&nbsp;\&nbsp; F(\tilde{I}) = w \odot F(I)\tag{3} \label{eq:op2}<br />
\end{align}<br />
<br />
where ''w'' is the excitation vector, <math>F(\tilde{p})</math> and <math>F(\tilde{I})</math> are the re-weighted features maps for query and target image respectively.<br />
<br />
In between the Squeeze layer and Co-Excitation layer, there exist two fully-connected layers followed by a sigmoid layer which helps to learn the excitation vector ''w''. The ''Channel Attention'' module in the architecture is basically these fully-connected layers followed by a sigmoid layer.<br />
<br />
==== Margin-based Ranking Loss ====<br />
<br />
The authors have defined a two-layer MLP network ending with a softmax layer to learn a similarity metric which will help rank the proposals generated by the RPN module. In the first stage of training, each proposal is annotated with 0 or 1 based on the IoU value of the proposal with the ground-truth bounding box. If the IoU value is greater than 0.5 then that proposal is labeled as 1 (foreground) and 0 (background) otherwise.<br />
<br />
Let ''q'' be the feature vector obtained after applying GAP to the query image patch obtained from the Squeeze and Co-Excitation block and ''r'' be the feature vector obtained after applying GAP to the region proposals generated by the RPN module. The two vectors are concatenated to form a new vector ''x'' which is the input to the two-layer MLP network designed. We can define ''x = [<math>r^T;q^T</math>]''. Let ''M'' be the model representing the two-layer MLP network, then <math>s_i = M(x_i)</math>, where <math>s_i</math> is the probability of <math>i^{th}</math> proposal being a foreground proposal based on the query image patch ''q''.<br />
<br />
The margin-based ranking loss is given by:<br />
<br />
\begin{align}<br />
L_{MR}(\{x_i\}) = \sum_{i=1}^{K}y_i \times max\{m^+ - s_i, 0\} + (1-y_i) \times max\{s_i - m^-, 0\} + \delta_{i} \tag{4} \label{eq:op3}<br />
\end{align}<br />
\begin{align}<br />
\delta_{i} = \sum_{j=i+1}^{K}[y_i = y_j] \times max\{|s_i - s_j| - m^-, 0\} + [y_i \ne y_j] \times max\{m^+ - |s_i - s_j|, 0\} \tag{5} \label{eq:op4}<br />
\end{align}<br />
<br />
where ''[.]'' is the Iversion bracket, i.e. the output will be 1 if the condition inside the bracket is true and 0 otherwise, <math>m^+</math> is the expected lower bound probability for predicting a foreground proposal, <math>m^-</math> is the expected upper bound probability for predicting a background proposal and <math>K</math> is the number of candidate proposals from RPN.<br />
<br />
The total loss for the model is given as:<br />
<br />
\begin{align}<br />
L = L_{CE} + L_{Reg} + \lambda \times L_{MR} \tag{6} \label{eq:op5}<br />
\end{align}<br />
<br />
where <math>L_{CE}</math> is the cross-entropy loss, <math>L_{Reg}</math> is the regression loss for bounding boxes of Faster R-CNN [1] and <math>L_{MR}</math> is the margin-based ranking loss defined above.<br />
<br />
For this paper, <math>m^+</math> = 0.7, <math>m^-</math> = 0.3, <math>\lambda</math> = 3, K = 128, C(z) in \eqref{eq:op} is the total number of elements in a single feature map of vector ''z'', and <math>f(x_i, z_j)</math> in \eqref{eq:op} is a dot product operation.<br />
\begin{align}<br />
f(x_i, z_j) = \alpha(x_i)^T \beta(z_j)\&nbsp;\&nbsp;;\&nbsp;\&nbsp;\alpha(x_i) = W_{\alpha} x_i \&nbsp;\&nbsp;;\&nbsp;\&nbsp; \beta(z_j) = W_{\beta} z_j \tag{7} \label{eq:op6}<br />
\end{align}<br />
<br />
== Results ==<br />
<br />
The model is trained and tested on two popular datasets, VOC and COCO. The ResNet-50 model was pre-trained on a reduced dataset by removing all the classes present in the COCO dataset, thus ensuring that the model has not seen any of the classes belonging to the inference images.<br />
<br />
==== Results on VOC Dataset ====<br />
<br />
[[File: voc_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 1:''' Results on VOC dataset</div><br />
<br />
For the VOC dataset, the model is trained on the union of VOC 2007 train and validation sets and VOC 2012 train and validation sets, whereas the model is tested on VOC 2007 test set. From the VOC results (Table 1), it can be seen that the model with pre-trained ResNet-50 on a reduced training set as the CNN backbone (Ours(725)) achieves better performance on seen and unseen classes than the baseline models. When the pre-trained ResNet-50 on the full training set (Ours(1K)) is used as the CNN backbone, then the performance of the model is increased significantly.<br />
<br />
==== Results on MSCOCO Dataset ====<br />
<br />
[[File: mscoco_splits.png|750px|center|Image: 500 pixels]]<br />
[[File: mscoco_results_object_detection.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2:''' Results on COCO dataset</div><br />
<br />
The model is trained on the COCO train2017 set and evaluated on the COCO val2017 set. The classes are divided into four groups and the model is trained with images belonging to three splits, whereas the evaluation is done on the images belonging to the fourth split. From Table 2, it is visible that the model achieved better accuracy than the baseline model. The bar chart value in the split figure shows the performance of the model on each class separately. The model is having some difficulties when predicting images belonging to classes like the book (split2), handbag (split3), and tie (split4) because of variations in their shape and textures.<br />
<br />
==== Overall Performance ====<br />
For VOC, the model that uses the reduced ImageNet model backbone with 725 classes achieves a better performance on both the seen and unseen classes. Remarkable improvements in the performance are seen with the backbone with 1000 classes. For COCO, the model achieves better accuracy than the Siamese Mask-RCNN model for both the seen and unseen classes.<br />
<br />
== Ablation Studies ==<br />
<br />
==== Effect of all the proposed techniques on the final result ====<br />
<br />
[[File: one_shot_detector_results.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 3:''' Effect of all thre techniques combined</div><br />
<br />
Figure 3 shows the effect of the three proposed techniques on the evaluation metric. The model performs worst when neither Co-attention nor Co-excitation mechanism is used. But, when either Co-attention or Co-excitation is used then the performance of the model is improved significantly. The model performs best when all the three proposed techniques are used.<br />
<br />
<br />
In order to understand the effect of the proposed modules, the authors analyzed each module separately.<br />
<br />
==== Visualizing the effect of Non-local RPN ====<br />
<br />
To demonstrate the effect of Non-local RPN, a heatmap of generated proposals is constructed. Each pixel is assigned the count of how many proposals cover that particular pixel and the counts are then normalized to generate a probability map.<br />
<br />
[[File: one_shot_non_local_rpn.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 4:''' Visualization of Non-local RPN</div><br />
<br />
From Figure 4, it can be seen that when a non-local RPN is used instead of a conventional RPN, the model is able to give more attention to the relevant region in the target image.<br />
<br />
==== Analyzing and Visualizing the effect of Co-Excitation ====<br />
<br />
To visualize the effect of excitation vector ''w'', the vector is calculated for all images in the inference set which are then averaged over images belonging to the same class, and a pair-wise Euclidean distance between classes is calculated.<br />
<br />
[[File: one_shot_excitation.png|250px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 5:''' Visualization of Co-Excitation</div><br />
<br />
From Figure 5, it can be observed that the Co-Excitation mechanism is able to assign meaningful weight distribution to each class. The weights for classes related to animals are closer to each other and the ''person'' class is not close to any other class because of the absence of common attributes between ''person'' and any other class in the dataset.<br />
<br />
[[File: analyzing_co_excitation_1.png|Analyzing Co-Exitation|500px|left|bottom|Image: 500 pixels]]<br />
<br />
[[File: analyzing_co_excitation_2.png|Analyzing Co-Excitation|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 6:''' Analyzing Co-Exitation</div><br />
<br />
To analyze the effect of Co-Excitation, the authors used two different scenarios. In the first scenario (Figure 6, left), the same target image is used for different query images. <math>p_1</math> and <math>p_2</math> query images have a similar color as the target image whereas <math>p_3</math> and <math>p_4</math> query images have a different color object as compared to the target image. When the pair-wise Euclidean distance between the excitation vector in the four cases was calculated, it can be seen that <math>w_2</math> was closer to <math>w_1</math> as compared to <math>w_4</math> and <math>w_3</math> was closer to <math>w_4</math> as compared to <math>w_1</math>. Therefore, it can be concluded that <math>w_1</math> and <math>w_2</math> give more importance to the texture of the object whereas <math>w_3</math> and <math>w_4</math> give more importance to channels representing the shape of the object.<br />
<br />
The same observation can be analyzed in scenario 2 (Figure 6, right) where the same query image was used for different target images. <math>w_1</math> and <math>w_2</math> are closer to <math>w_a</math> than <math>w_b</math> whereas <math>w_3</math> and <math>w_4</math> are closer to <math>w_b</math> than <math>w_a</math>. Since images <math>I_1</math> and <math>I_2</math> have a similar color object as the query image, we can say that <math>w_1</math> and <math>w_2</math> give more weightage to the channels representing the texture of the object, and <math>w_3</math> and <math>w_4</math> give more weightage to the channels representing shape.<br />
<br />
== Conclusion ==<br />
<br />
The resulting one-shot object detector outperforms all the baseline models on VOC and COCO datasets. The authors have also provided insights about how the non-local proposals, serving as a co-attention mechanism, can generate relevant region proposals in the target image and put emphasis on the important features shared by both target and query image.<br />
<br />
== Critiques ==<br />
<br />
The techniques proposed by the authors improve the performance of the model significantly as we saw that when either of Co-attention or Co-excitation is used along with Margin-based ranking loss then the model can detect the instances of query object in the target image. Also, the model trained is generic and does not require any training/fine-tuning to detect any unseen classes in the target image. The loss metric designed makes the learning process not to rely on only the labels of images since the proposed metric annotates each proposal as a foreground or a background which is then used to calculate the metric.<br />
Since it is exploiting many deep neural networks inside the main architecture, one critique that comes across is how time-consuming the proposed model is. The paper could have elucidated it more thoroughly whether the method is too time-consuming or not.<br />
<br />
== Source Code==<br />
[https://github.com/timy90022/One-Shot-Object-Detection link One-Shot-Object-Detection]<br />
<br />
== References ==<br />
<br />
[1] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.<br />
<br />
[2] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, pages 765–781, 2018<br />
<br />
[3] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2836–2843, 2018.<br />
<br />
[4] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng, and Trevor Darrell. Few-shot object detection via feature reweighting. CoRR, abs/1812.01866, 2018.<br />
<br />
[5] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data&diff=49613Learning The Difference That Makes A Difference With Counterfactually-Augmented Data2020-12-06T23:42:29Z<p>Jlavilez: Added section on related work</p>
<hr />
<div>== Presented by == <br />
Syed Saad Naseem<br />
<br />
== Introduction == <br />
This paper addresses the problem of building models for NLP tasks that are robust against spurious correlations in the data. The authors tackle this problem by introducing a human-in-the-loop method in which human annotators were hired to modify data in order to make it in a way that represents the opposite label. For example, if a text had a positive sentiment to it, the annotators change the text such that it represents the negative sentiment while making minimal changes to the text. They refer to this process as counterfactual augmentation. The authors apply this method to the IMDB sentiment dataset and to SNLI and show that many models can not perform well on the augmented dataset when trained only on the original dataset and vice versa. The human-in-the-loop system which is designed for counterfactually manipulating documents aims that by intervening only upon the factor of interest, they might disentangle the spurious and non-spurious associations, yielding classifiers that hold up better when spurious associations do not transport out of the domain.<br />
<br />
== Background == <br />
'''What are spurious patterns in NLP, and why do they occur?'''<br />
<br />
Current supervised machine learning systems try to learn the underlying features of input data that associate the inputs with the corresponding labels. Take Twitter sentiment analysis as an example, there might be lots of negative tweets about Donald Trump. If we use those tweets as training data, the ML systems tend to associate "Trump" with the label: Negative. However, the text itself is completely neutral. The association between the text trump and the label negative is spurious. One way to explain why this occurs is that association does not necessarily mean causation. For example, the color gold might be associated with success. But it does not cause success. Current ML systems might learn such undesired associations and then deduce from them. This is typically caused by an inherent bias within the data. ML models then learn the inherent bias which leads to biased predictions.<br />
<br />
== Data Collection ==<br />
The authors used Amazon’s Mechanical Turk which is a crowdsourcing platform using to recruit editors. They hired these editors to revise each document. <br />
<br />
'''Sentiment Analysis'''<br />
<br />
The dataset to be analyzed is the IMDb movie review dataset. The annotators were directed to revise the reviews to make them counterfactual, without making any gratuitous changes. There are several types of changes that were applied and two examples are listed below, where red represents original text and blue represents modified text.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Type of Change !! Original Review !! Modified Review<br />
|-<br />
| Change ratings || one of the worst ever scenes in a sports movie. <span style="color:red">3 stars out of 10</span>. || one of the wildest ever scenes in a sports movie. <span style="color:blue">8 stars out of 10</span>.<br />
|-<br />
| Suggest sarcasm || thoroughly captivating <span style="color:red">thriller-drama, taking a deep and realistic</span> view. || thoroughly mind numbing <span style="color:blue">“thriller-drama”, taking a “deep” and “realistic” (who are they kidding?)</span> view.<br />
|}<br />
<br />
[[File:jaccard_similarity_results.png|500px|center]]<br />
<br />
A deeper understanding of what is actually causing the reviews to be positive/negative could be obtained when the counterfactually-revised reviews were compared with corresponding original reviews. The indices corresponding to replacements/insertions were marked and the edits in the original review were represented by a binary vector. Jaccard similarity was evaluated between the two reviews and a negative correlation was observed (seen in the above table) with the length of the review.<br />
<br />
'''Natural Language Inference'''<br />
<br />
The NLI is a 3-class classification task, where the inputs are a premise and a hypothesis. Given the inputs, the model predicts a label that is meant to describe the relationship between the facts stated in each sentence. The labels can be entailment, contradiction, or neutral. The annotators were asked to modify the premise of the text while keeping the hypothesis intact and vice versa. Some examples of modifications are given below with labels given in the parentheses.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Premise !! Original Hypothesis !! Modified Hypothesis<br />
|-<br />
| A young dark-haired woman crouches on the banks of a river while washing dishes. || A woman washes dishes in the river <span style="color:red">while camping</span> (Neutral) || A woman washes dishes <span style="color:blue">in the river</span>. (Entailment)<br />
|-<br />
| Students are inside of a lecture hall || Students are <span style="color:red">indoors</span>. (Entailment) || Students are <span style="color:blue">on the soccer field</span>. (Contradiction)<br />
|-<br />
| An older man with glasses raises his eyebrows in surprise. || The man <span style="color:red">has no glasses</span>. (Contradiction) || The man <span style="color:blue">wears bifocals</span>. (Neutral)<br />
|}<br />
<br />
After the data collection, a different set of workers was employed to verify whether the given label<br />
accurately described the relationship between each premise-hypothesis pair. Each pair was presented to 3 workers and the pair was only accepted if all 3 of the workers approved that the text is accurate. This entire process cost the authors about $10778.<br />
<br />
== Example ==<br />
In the picture below, we can see an example of spurious correlation and how the method presented here can address that. The picture shows the most important features learned by SVM. As we can see in the left plot when the model is trained only on the original data, the word "horror" is associated with the negative label and the word "romantic" is associated with the positive label. This is an example of spurious correlation because we definitely can have both bad romantic and good horror movies. The middle plot shows the case that the model is trained only on the revised dataset. As we expected the situation is vice versa, that is, "horror" and "romantic" are associated with the positive and negative labels respectively. However, the problem is solved in the right plot where the authors trained the model on both the original and the revised datasets. The words "horror" and "romantic" are no longer among the most important features which is what we wanted.<br />
<br />
[[File: SVM features.png | center |800px]]<br />
<br />
== Experiments ==<br />
===Sentiment Analysis===<br />
The authors carried out experiments on a total of 5 models: Support Vector Machines (SVMs), Naive Bayes<br />
(NB) classifiers, Bidirectional Long Short-Term Memory Networks, ELMo models with LSTM, and fine-tuned BERT models. Furthermore, they evaluated their models on Amazon reviews datasets aggregated over six genres, they also evaluated the models on twitters sentiment dataset and on Yelp reviews released as part of a Yelp dataset challenge. They showed that almost all cases, models trained on the counterfactually-augmented<br />
IMDb dataset perform better than models trained on comparable quantities of original data, this is shown in the table below.<br />
<br />
[[File:result1_syed.PNG]]<br />
<br />
===Natural Language Inference===<br />
<br />
To see the results of BERT model on the SNLI tasks, the authors used different sets of train and eval sets. The fine-tuned version of BERT on the original data(1.67k) performs well on the original eval set; however, the accuracy drops from 72.2% to 39.7% when evaluated on the RP(Revised Premise) set. It is also the case even with the full original set(500k) i.e. the accuracy of the model drops significantly on the RP, RH (Revised Hypothesis), and RP&RH datasets. In Table 7, you can see that the BERT model which was fine-tuned on a combination of RP and RH leads to consistent performance on all datasets.<br />
<br />
[[File:NLI.png|center]]<br />
== Source Code ==<br />
<br />
The official code is available at https://github.com/acmi-lab/counterfactually-augmented-data .<br />
<br />
== Related Work ==<br />
<br />
The authors broadly describe non-spuriousness as "the difference that makes the difference". They mention that there is some literature in which NLP systems are unable to pinpoint what humans would consider "the difference that makes the difference". For instance, the work by Jia and Liang shows that some SOTA models are unstable with respect to distractor phrases [4]. Other work shows that SOTA models can do poorly with respect to classifying paraphrased sentences [5]. As a last example, some work shows that ML-based NLI systems can be broken by changing words by synonyms or hypernyms [6]. <br />
<br />
The proposed counterfactual augmentation of semantic datasets is a useful means to avoid the problems highlighted in [4,5,6] by means of asking humans to (i) provide counterfactual labels, (ii) retain internal coherence, and (iii) avoid unnecessary changes.<br />
<br />
== Conclusion ==<br />
<br />
The authors propose a new way to augment textual datasets for the task of sentiment analysis, this helps the learning methods used to generalize better by concentrating on learning the different that makes a difference. I believe that the main contribution of the paper is the introduction of the idea of counterfactual datasets for sentiment analysis. The paper proposes an interesting approach to tackle NLP problems, shows intriguing experimental results, and presents us with an interesting dataset that may be useful for future research. Indeed, this work has been cited in several interesting works examining gender bias in NLP [1], making AI programs more ethical [2], and generating humor text [3].<br />
<br />
== References ==<br />
<br />
[1] Lu, K., Mardziel, P., Wu, F., Amancharla, P., & Datta, A. (2018). Gender Bias in Neural Natural Language Processing.<br />
<br />
[2] Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2020). Aligning AI With Shared Human Values. 1–22.<br />
<br />
[3] Weller, O., Fulda, N., & Seppi, K. (2020). Can Humor Prediction Datasets be used for Humor Generation? Humorous Headline Generation via Style Transfer. 186–191.<br />
<br />
[4] Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Empirical Methods in Natural Language Processing (EMNLP), 2017.<br />
<br />
[5] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.<br />
<br />
[6] Max Glockner, Vered Shwartz, and Yoav Goldberg. Breaking nli systems with sentences that require simple lexical inferences. In Association for Computational Linguistics (ACL), 2018.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm&diff=49568Augmix: New Data Augmentation method to increase the robustness of the algorithm2020-12-06T22:11:01Z<p>Jlavilez: </p>
<hr />
<div>== Presented by == <br />
Abhinav Chanana<br />
<br />
== Introduction == <br />
Often times machine learning algorithms assume that the training data is the correct representation of the data encountered during deployment. Algorithms generally ignore the chances of receiving little corruption which leads to less robustness and reduction in their accuracy as the models try to fit the noise for the predictions as well. A few corruptions have the potential to reduce the performance of various models like stated in the Hendrycks & Dietterich (2019), showing that the classification error rose from 25% to 62% when some corruption was introduced on the ImageNet test set. <br />
The problem with introducing some corruption is that it encourages the models or the network to memorize the specific corruptions and is, therefore, unable to generalize these corruptions. The paper also provides evidence that networks trained on translation augmentations are highly sensitive to the shifting of pixels.<br />
The paper comes with a new algorithm known as AugMix, a method which achieves new state-of-the-art results for robustness and uncertainty estimation while maintaining accuracy on standard benchmark datasets. The paper uses CIFAR 10, CIFAR100, ImageNet datasets for confirming the results. AUGMIX utilizes stochasticity and diverse augmentations, a Jensen-Shannon Divergence consistency loss, and a formulation to mix multiple augmented images to achieve state-of-the-art performance<br />
<br />
== Background ==<br />
Data Augmentation helps to increase the size of the dataset by creating variations of existing images. This helps the model to generalize better, prevent overfitting and make the model more robust. Basic types of data augmentation techniques are Flipping, Rotation, Shearing, Cropping, etc. In the Flipping technique, the image is flipped horizontally or vertically. In the Rotation technique, the image is rotated by a certain degree, whereas, in the Cropping technique, a part of the image is removed to make the object appear in different proportions in different positions in the image.<br />
<br />
== Approach ==<br />
<br />
At a high level, AugMix does some basic augmentations techniques. These augmentations are often layered to create a high diversity of augmented images. The loss is calculated using the Jensen-Shannon divergence method.<br />
<br />
[[File:augmix_Milad.gif|center|1000px|Image: 1000 pixels]]<br />
<br />
<br />
The method proposed by the author can be divided into 3 major sections:<br />
<br />
'''1. Augmentations''': The author uses basic data augmentation chains and the composition of data augmentation operations using AutoAugment. A chain is created like shown in the figure above<br />
<br />
'''2. Mixing''': The resulting images from these augmentation chains are combined by mixing. The author chose to use elementwise convex combinations for simplicity. The k-dimensional vector of convex coefficients is randomly sampled from a Dirichlet(α, . . . , α) distribution. The intuition behind using a Dirichlet distribution is that it allows us to sample coefficients from (0, 1) that sum to 1. Once these images are mixed, the author uses a “skip connection” to combine the result of the augmentation chain and the original image through a second random convex combination sampled from a Beta(α, α) distribution.<br />
<br />
'''3. Jensen-Shannon divergence''': The author augments the original loss function with the Jensen-Shannon divergence loss to enforce stable and consistent output: [[File:loss fn.png]]<br />
<br />
<math>p_\text{orig}</math>, <math>p_\text{augmix1}</math> and <math>p_\text{augmix2}</math> are the posterior distributions of the original input <math>x_\text{orig}</math>, and its augmented variants: <math>x_\text{augmix1}, x_\text{augmix2}</math>, respectively.<br />
<br />
The JS in the above formula means the Jensen-Shannon divergence. It measures the similarities between distributions and is based on KL divergence. However, the Jensen-Shannon divergence is symmetric and can be viewed as a smoothed and normalized version of KL divergence. The JS divergence is particularly helpful when we are comparing multiple distributions.<br />
<br />
[[File:augmix 3.png|center|1000px|Image: 1000 pixels]]<br />
<br />
where KL means KL Divergence between porig and paugmix<br />
<br />
<br />
The pseudocode for the algorithm:<br />
<br />
[[File:augmix 2.png|center|1000px|Image: 1000 pixels]]<br />
<br />
For example, the pseudocode can be implemented in '''Python''' as follows:<br />
<syntaxhighlight lang="python"><br />
import numpy as np<br />
def augmix(orig_image, operations, k=3, alpha=1):<br />
aug_image = np.zeros(orig_image.shape)<br />
weights = np.random.dirichlet(np.ones(k)*alpha)<br />
for i in range(k):<br />
op1, op2, op3 = np.random.choice(operations, 3)<br />
chain = np.random.uniform()<br />
if 3*chain < 1:<br />
aug_image += op1(orig_image)<br />
elif 3*chain <2:<br />
aug_image += op2(op1(orig_image))<br />
else:<br />
aug_image += op3(op2(op1(orig_image)))<br />
m = np.random.beta(alpha, alpha)<br />
augmix = m*orig_image + (1-m)*aug_image<br />
return augmix<br />
</syntaxhighlight><br />
<br />
== Data Set Used ==<br />
<br />
The authors use the following datasets for conducting the experiment.<br />
<br />
1. CIFAR 10: This dataset, along with the CIFAR-100 dataset, are labeled subsets of the 80 million tiny images dataset and were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset is composed of 60000 color images of 32x32 pixels. These images are in 10 classes, with 6000 images per class, 50000 for training, and 10000 for testing. This dataset is used in numerous computer vision journals to compare their algorithms. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
2. CIFAR 100: The difference between this dataset and the CIFAR-10 dataset is that it includes 100 classes of images with 600 images per each class. These classes are also grouped in 20 super-classes, e.g. the flowers' superclass that contains orchids, poppies, roses, sunflowers, and tulips. - https://www.cs.toronto.edu/~kriz/cifar.html<br />
<br />
3. ImageNet: This dataset aims to obtain at least 1000 images per "synonym set" or "sysnet" in the WordNet hierarchy. WordNet is a large lexical database of English where nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. This dataset is currently home to 1.2 million labelled images. - http://image-net.org/download<br />
<br />
== Experiments==<br />
<br />
The author used CIFAR-10-C, CIFAR-100-C, and ImageNet-C datasets which are constructed by adding corruption to the original datasets. The CIFAR-10-P, CIFAR-100-P, and ImageNet-P datasets also modify the original CIFAR and ImageNet datasets. These datasets contain smaller perturbations than CIFAR-C and are used to measure the classifier’s prediction stability.<br />
The metrics used for comparison of the models is the error rate of the algorithm. The clean error is achieved by getting the error rates without applying any corruption of the datasets. In the experiment, the author uses 15 corruption techniques hence the error rate after corruption is taken as the average of all the error rates achieved by the specific model.<br />
In order to assess a model’s uncertainty estimates, we measure its miscalibration. The author uses Brier Score or d RMS Calibration Error for this purpose.<br />
<br />
<br />
===Results on CIFAR===<br />
<br />
For CIFAR datasets, 15 corruptions have been applied<br />
<br />
Setup: The author has used three models for comparison:<br />
1.A DenseNet-BC (k = 12, d = 100)<br />
2.A 40-2 Wide ResNet<br />
3.A ResNeXt-29<br />
The All Convolutional Network and Wide ResNet train for 100 epochs, and the DenseNet and ResNeXt require 200 epochs for convergence and weight decay of 0.0001 for Mixup and 0.0005 otherwise.<br />
<br />
[[File:CIFAR 1.png|center|1000px|Image: 1000 pixels]]<br />
<br />
The author has further compared it to other state-of-the-art algorithms used for data augmentation, which can be seen in the above figure. The AugMix algorithm performs the best with a 16.6% lower absolute corruption error. This method only uses ResNeXt on CIFAR-10-C for comparison purposes.<br />
<br />
[[File:CIFAR 2.png|center|1000px|Image: 1000 pixels]]<br />
<br />
===Results on ImageNet Dataset===<br />
<br />
<br />
[[File:imageNet 1.png|center|1000px|Image: 1000 pixels]]<br />
<br />
This shows Clean Error, Corruption Error (CE), and mCE values for various methods on ImageNet-C.<br />
The mCE value is computed by averaging across all 15 CE values. AUGMIX reduces corruption error<br />
while improving clean accuracy, and it can be combined with SIN for greater corruption robustness.<br />
<br />
== Source Code ==<br />
The source code is available at: https://github.com/google-research/augmix<br />
== Conclusion ==<br />
AUGMIX is a data processing technique that mixes randomly generated augmentations and uses a Jensen-Shannon loss to enforce consistency. The simple-to-implement technique obtains<br />
state-of-the-art performance on CIFAR and ImageNet.AUGMIX seems to enable more reliable models, a necessity for models deployed in safety-critical environments. Using AugMix with the above-specified models performs better and tolerant of corruption.<br />
<br />
<br />
== Critique ==<br />
<br />
Since augmix1 and augmix2 are independent, why did they use JS divergence over the mixture of the three? What happened if they only used <br />
<math><br />
\frac{1}{2} (KL(p_{orig},p_{augmix1})+KL(p_{orig}, p_{augmix2}))<br />
</math>. In other words, what is the priority of the JS over simple KL?<br />
<br />
== Related Work ==<br />
<br />
Recently, a lot of approaches to Mixed Sample Data Augmentation have been proposed, many of which obtain state-of-the-art performance in several classical classification tasks. The contribution of AugMix is to perform MixUp on highly augmented variations of a provided image. By the addition of a trick called Fast AutoAugment the authors of [1] claim they can beat the state-of-the-art (including beating AugMix) in the Fashion-MNIST dataset. What the authors do is apply a binary mask to low frequency images sampled from the Fourier space corresponding to the dataset. In particular, the mask arises from the following low-pass filter. Given a complex Gaussian random matrix <math>Z</math>, and a decay power <math>\delta</math>, we let:<br />
\begin{align}<br />
filter(z, \delta) [i,j] = \frac{z[i,j]}{freq(w,h) [i,j]^\delta}<br />
\end{align}<br />
<br />
If <math>\mathcal{F}^{-1}</math> is the inverse discrete Fourier transform, we gray scale the image by setting:<br />
<br />
\begin{align}<br />
G = Re (\mathcal{F}^{-1} (filter(Z, \delta )) ) <br />
\end{align}<br />
<br />
Finally, this can be converted to a binary mask with mean <math>\lambda</math> on an image <math>g</math> by setting:<br />
<br />
\begin{align}<br />
mask(\lambda , g)[i,j] = \chi_{ top(\lambda w h, g g) }<br />
\end{align}<br />
<br />
Where <math>\chi</math> is the indicator function.<br />
<br />
== Bibliography == <br />
<br />
[1] Harris, E., Marcu, A., Painter, M., Niranjan, M., & Hare, A. P. B. J. (2020). Fmix: Enhancing mixed sample data augmentation. arXiv preprint arXiv:2002.12047, 3.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=49554Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-06T21:50:54Z<p>Jlavilez: /* Methodology */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterised variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalised Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimisation problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimise the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximised w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png|centre]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularisation term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularisation methods such as batch normalisation are used).<br />
* The order of optimising <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalised to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterised as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
=== On the Fisher metric ===<br />
<br />
The Fisher metric between two points <math>x,y \in \mathbb{R}^d</math> is defined as <math> d(x,y) = (U_x - Y_y)^T \mathcal{I}^{-1} (U_x - U_y) </math>; being a quadratic form, it is elementary to see that this distance is equivalent to the usual norm in <math>\mathbb{R}^d</math>.<br />
A notion of a Fisher distance between finite subsets <math>X,Y \subset \mathbb{R}^d</math> is: <br />
\begin{align}<br />
dist(X,Y) = \left( \frac{1}{|X|} \sum_{x \in X} U_x - \frac{1}{|Y|} \right)^T \mathcal{I}^{-1} \left( \frac{1}{|X|} \sum_{x \in X} U_x - \frac{1}{|Y|} \right)<br />
\end{align}<br />
<br />
Note, however, that this is not a metric. An alternative to extending the Fisher distance to compact subsets of <math>\mathbb{R}^d</math> is to use the Hausdorff metric from elementary real analysis. A discussion of the Hausdorff metric can be found on page 5 of Ken Davidson's Real Analysis notes: http://www.math.uwaterloo.ca/~krdavids/PM351/PMath351Notes.pdf. For completeness, given two compact subsets <math>K, L \subset \mathbb{R}^d</math> the Hausdorff metric induced by the Fisher distance is given by:<br />
\begin{align}<br />
d_H(K,L) = \max \left\lbrace \sup_{a \in K} d(a, L), \sup_{b \in L} d(b,K) \right\rbrace<br />
\end{align}<br />
<br />
It is readily seen that this is a complete metric on the metric space of compact subsets of <math>\mathbb{R}^d</math>. The analysis performed in the paper might be able to be extended to the choice of this metric.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalised density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it uses the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrised sampler. In some works, regularisation (by noise addition, penalising gradients, spectral normalisation) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work. The following equation show how this method does not always sample from the generator but a small proportion (with probability p) of the samples come from real examples.<br />
<br />
<div align="center">[[File:related_work_equations.png]]</div><br />
<br />
Early works showed incorporation of Fisher Information to measure similarity and this was extended to use Fisher Vector representations in case of images. Recently, Fisher Information has been used for meta learning as well. This paper explores the possibility of using Fisher Information in deep learning generative models. By using the generator as a sampler, Fisher Information can be computed even from an unnormalised density model.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterises the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalisation ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 shows the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterised MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantisation.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=49535Time-series Generative Adversarial Networks2020-12-06T21:26:08Z<p>Jlavilez: /* References */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy, and Events with different methods to see their performance. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analysing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models in general is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.<br />
<br />
[12] Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting 36.1 (2020): 54-74.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=49534Time-series Generative Adversarial Networks2020-12-06T21:25:41Z<p>Jlavilez: /* Critique */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup>, based on curriculum learning <sup>[2]</sup>, where the models are trained to output based on a combination of ground truth and previous outputs. Another method inspired by adversarial domain adaptation is training an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>. Approach based on Actor-critic methods <sup>[5]</sup> have also been proposed that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions <sup>[11]</sup>. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics.<br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain constant over the entire time-series, or for a long period of time) and temporal features (variables that changes with respect to time). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, inputs to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution <math>p</math>. The objective of a generative model is to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how the information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form:<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form:<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture.<br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation:<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate Gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy, and Events with different methods to see their performance. <br />
<br />
===Sines===<br />
They simulated multivariate sinusoidal sequences of different frequencies η and phases θ, providing continuous-valued, periodic, multivariate data where each feature is independent of others.<br />
<br />
===Stocks===<br />
By contrast, sequences of stock prices are continuous-valued but aperiodic; furthermore, features are correlated with each other. They use the daily historical Google stocks data from 2004 to 2019, including as features the volume and high, low, opening, closing, and adjusted closing prices.<br />
<br />
===Energy===<br />
They consider a dataset characterized by noisy periodicity, higher dimensionality, and correlated features. The UCI Appliances energy prediction dataset consists of multivariate, continuous-valued measurements including numerous temporal features measured at close intervals.<br />
<br />
===Events===<br />
Finally, they considered a dataset characterized by discrete values and irregular time stamps. They used a large private lung cancer pathways dataset consisting of sequences of events and their times, and model both the one-hot encoded sequence of event types and the event timings.<br />
<br />
Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score, and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks. <br />
<br />
TimeGAN benefits from merging supervised and unsupervised learning to create their generations while other methods in the literature benefit from learning their conditional input to create its generations. I believe after even considering the supervised and unsupervised learning, the way that the authors introduced temporal embeddings to assist network training is not designed well for anomaly detection (outlier detection) as it is only designed for time series representation learning as discussed in [10].<br />
<br />
The paper certainly proposes a novel approach to analysing time series data, but there are concerns about the way the model is tested in practice. First, if the data is generated from a <math>VAR(1)</math> model, why would the authors would not use a multi-dimensional auto-ARIMA procedure, or a Box-Jenkins approach, to fit a model to their synthetic dataset. Moreover, as has been studied in the M4 competitions (see e.g. https://www.sciencedirect.com/science/article/pii/S0169207019301128), the ability of complex ML models or deep learning models to beat linear models in general is questionable. The theoretical reason for this empirical finding is that the Wold decomposition theorem says that a stationary process can be decomposed into the sum of a deterministic process and linear process, which gives a lot of credence to the ARIMA model. It would be highly beneficial if the authors included the Box-Jenkins benchmark in their experiments as well as testing their model against real data to see if it actually performs well.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020<br />
<br />
[10] Geiger, Alexander et al. “TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks.” ArXiv abs/2009.07769 (2020): n. pag.<br />
<br />
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=49519Functional regularisation for continual learning with gaussian processes2020-12-06T21:10:52Z<p>Jlavilez: /* A large class of examples of Gaussian processes */</p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularise Continual Learning (CL) so that it does not forget previously learned tasks. This method, referred to as functional regularisation for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then used in optimisation as a regulariser to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimisation of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularisation-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularise the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularising weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimised using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularisation-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularises the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinite-dimensional generalisation of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimisation in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterised by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrised by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== A large class of examples of Gaussian processes ==<br />
<br />
The prima facie example of a Gaussian process in continuous time is a Brownian motion. It turns out that Brownian motion is a key ingredient in constructing a large class of Gaussian processes, which can be achieved through the Wiener integral. We write this down as a Proposition and prove it below.<br />
<br />
'''Proposition''' Let <math>B = \{ B_t \}_{t \geq 0}</math> be a Brownian motion on a filtered probability space and <math>f</math> a square integrable deterministic function. Then the process <math>X_t</math> given by the Wiener integral <math>X_t = \int_0^t f(s) dB_s</math> is a Gaussian process.<br />
<br />
For a reference to several elementary constructions of the Wiener integral, we refer the reader to the textbook by Kuo [6]. Intuitively, the stochastic process <math>X_t</math> can be thought as the gain or loss of the strategy <math>f(s)</math> when playing a fair game induced by tracking a Brownian motion.<br />
<br />
''Proof'' The Wiener integral of a simple function is the sum of independent centred normal random variables, which is in turn a centred Gaussian process. Since the space of Gaussian processes is closed in <math>L^2 (\Omega \times [0,T])</math>, as we take the limit as in the construction of the Wiener integral, the process converges and remains a centred Gaussian process. QED<br />
<br />
Using the above proposition, one simple example of a Gaussian process is a Brownian bridge, defined as <math>Y_t = (1-t) \int_0^t \frac{dB_s}{1-s}</math> for <math>0 \leq t < 1</math>. This process can be proved to be Gaussian using the proposition proved above.<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarise information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrise <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrised by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimising the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximising the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularise the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximised is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
As a result, we regularise the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimisation computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularisation term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrised by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimised inducing images cover examples from all classes as opposed to the randomised inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasise that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimised using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularisation-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimisation scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularisation terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularisation terms can make optimisation more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.<br />
<br />
[6] Kuo, H. "Introduction to Stochastic Integration Springer." Berlin Heidelberg (2006).</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=49511Functional regularisation for continual learning with gaussian processes2020-12-06T21:00:41Z<p>Jlavilez: </p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularise Continual Learning (CL) so that it does not forget previously learned tasks. This method, referred to as functional regularisation for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then used in optimisation as a regulariser to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimisation of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularisation-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularise the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularising weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimised using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularisation-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularises the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinite-dimensional generalisation of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimisation in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterised by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrised by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== A large class of examples of Gaussian processes ==<br />
<br />
The prima facie example of a Gaussian process in continuous time is a Brownian motion. It turns out that Brownian motion is a key ingredient in constructing a large class of Gaussian processes, which can be achieved through the Wiener integral. We write this down as a Proposition and prove it below.<br />
<br />
'''Proposition''' Let <math>B = \{ B_t \}_{t \geq 0}</math> be a Brownian motion on a filtered probability space and <math>f</math> a square integrable deterministic function. Then the process <math>X_t</math> given by the Wiener integral <math>X_t = \int_0^t f(s) dB_s</math> is a Gaussian process.<br />
<br />
For a reference to several elementary constructions of the Wiener integral, we refer the reader to the textbook by Kuo [6]. Intuitively, the stochastic process <math>X_t</math> can be thought as the gain or loss of the strategy <math>f(s)</math> when playing a fair game induced by tracking a Brownian motion.<br />
<br />
''Proof'' The Wiener integral of a simple function is the sum of independent centred normal random variables, which is in turn a centred Gaussian process. Since the space of Gaussian processes is closed in <math>L^2 (\Omega \times [0,T])</math>, as we take the limit as in the construction of the Wiener integral, the process converges and remains a centred Gaussian process. QED<br />
<br />
Using the above proposition, one simple example of a Gaussian process is a Brownian bridge, which can be constructed using the technique outlined above.<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarise information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrise <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrised by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimising the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximising the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularise the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximised is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
As a result, we regularise the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimisation computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularisation term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrised by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimised inducing images cover examples from all classes as opposed to the randomised inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasise that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimised using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularisation-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimisation scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularisation terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularisation terms can make optimisation more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.<br />
<br />
[6] Kuo, H. "Introduction to Stochastic Integration Springer." Berlin Heidelberg (2006).</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=49510Functional regularisation for continual learning with gaussian processes2020-12-06T21:00:25Z<p>Jlavilez: </p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularise Continual Learning (CL) so that it does not forget previously learned tasks. This method, referred to as functional regularisation for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then used in optimisation as a regulariser to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimisation of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularisation-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularise the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularising weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimised using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularisation-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularises the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinite-dimensional generalisation of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimisation in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterised by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrised by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== A large class of examples of Gaussian processes ==<br />
<br />
The prima facie example of a Gaussian process in continuous time is a Brownian motion. It turns out that Brownian motion is a key ingredient in constructing a large class of Gaussian processes, which can be achieved through the Wiener integral. We write this down as a Proposition and prove it below.<br />
<br />
'''Proposition''' Let <math>B = \{ B_t \}_{t \geq 0}</math> be a Brownian motion on a filtered probability space and <math>f</math> a square integrable deterministic function. Then the process <math>X_t</math> given by the Wiener integral <math>X_t = \int_0^t f(s) dB_s</math> is a Gaussian process.<br />
<br />
For a reference to several elementary constructions of the Wiener integral, we refer the reader to the textbook by Kuo [6]. Intuitively, the stochastic process <math>X_t</math> can be thought as the gain or loss of the strategy <math>f(s)</math> when playing a fair game induced by tracking a Brownian motion.<br />
<br />
''Proof'' The Wiener integral of a simple function is the sum of independent centred normal random variables, which is in turn a centred Gaussian process. Since the space of Gaussian processes is closed in <math>L^2 (\Omega \times [0,T])</math>, as we take the limit as in the construction of the Wiener integral, the process converges and remains a centred Gaussian process. <math>\square</math><br />
<br />
Using the above proposition, one simple example of a Gaussian process is a Brownian bridge, which can be constructed using the technique outlined above.<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarise information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrise <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrised by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimising the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximising the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularise the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximised is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
As a result, we regularise the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimisation computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularisation term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrised by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimised inducing images cover examples from all classes as opposed to the randomised inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasise that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimised using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularisation-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimisation scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularisation terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularisation terms can make optimisation more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.<br />
<br />
[6] Kuo, H. "Introduction to Stochastic Integration Springer." Berlin Heidelberg (2006).</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=49508Functional regularisation for continual learning with gaussian processes2020-12-06T20:59:57Z<p>Jlavilez: Examples of Gaussian processes</p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularise Continual Learning (CL) so that it does not forget previously learned tasks. This method, referred to as functional regularisation for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then used in optimisation as a regulariser to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimisation of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularisation-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularise the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularising weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimised using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularisation-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularises the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinite-dimensional generalisation of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimisation in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterised by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrised by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== A large class of examples of Gaussian processes ==<br />
<br />
The prima facie example of a Gaussian process in continuous time is a Brownian motion. It turns out that Brownian motion is a key ingredient in constructing a large class of Gaussian processes, which can be achieved through the Wiener integral. We write this down as a Proposition and prove it below.<br />
<br />
'''Proposition''' Let <math>B = \{ B_t \}_{t \geq 0}</math> be a Brownian motion on a filtered probability space and <math>f</math> a square integrable deterministic function. Then the process <math>X_t</math> given by the Wiener integral <math>X_t = \int_0^t f(s) dB_s</math> is a Gaussian process.<br />
<br />
For a reference to several elementary constructions of the Wiener integral, we refer the reader to the textbook by Kuo [6]. Intuitively, the stochastic process <math>X_t</math> can be thought as the gain or loss of the strategy <math>f(s)</math> when playing a fair game induced by tracking a Brownian motion.<br />
<br />
''Proof'' The Wiener integral of a simple function is the sum of independent centred normal random variables, which is in turn a centred Gaussian process. Since the space of Gaussian processes is closed in <math>L^2 (\Omega \times [0,T])</math>, as we take the limit as in the construction of the Wiener integral, the process converges and remains a centred Gaussian process. <math>\blacksquare</math><br />
<br />
Using the above proposition, one simple example of a Gaussian process is a Brownian bridge, which can be constructed using the technique outlined above.<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarise information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrise <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrised by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimising the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximising the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularise the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximised is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
As a result, we regularise the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimisation computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularisation term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrised by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimised inducing images cover examples from all classes as opposed to the randomised inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasise that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimised using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularisation-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimisation scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularisation terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularisation terms can make optimisation more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.<br />
<br />
[6] Kuo, H. "Introduction to Stochastic Integration Springer." Berlin Heidelberg (2006).</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=49497DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-12-06T20:48:01Z<p>Jlavilez: Markovianity</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning (RL) is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning. It refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalising' the network based on its behaviour over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviours purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The authors have changed the belief representations to learn a critic directly on latent state samples which help to enable scaling to more complex tasks. <br />
<br />
The main findings of the paper are that long-horizon behaviours can be learned by latent imagination. This avoids the short sightedness that comes with using finite imagination horizons. The authors have also managed to demonstrate empirical performance for visual control by evaluating the model on image inputs.<br />
<br />
[[File:Figure1 paper.png|100px|center]]<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximise future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behaviour learning, and environment interaction. In the compact latent space of the world model, the behaviour is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behaviour Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:ashraf98.png|frameless|700px|Dreamer algorithm|center]]<br />
<br />
Notice that three neural networks are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively. The action model tries to solve the imagination environment by predicting various actions. Meanwhile, the value model estimates the expected rewards that the action model will achieve. Hence, these two models are trained cooperatively whereby the action model tries to maximize the estimated value while the value model gives the estimate based on the action model's actions.<br />
<br />
=== The Markovianity Question ===<br />
<br />
The paper formulates visual control as a so-called Partially Observable Markov Decision Processs (POMDP) in discrete time. Since the goal is for an agent to maximise its sum of rewards in a Markovian setting, this puts the model squarely in the category of reinforcement learning. In this subsection we provide a lengthier discussion on this Markovian assumption.<br />
<br />
Note that the transition distribution provided in the representation and transition models are Markovian in the states <math>s_t</math> and <math>a_t</math>. This mimics the dynamics in a non-linear Kalman filter and hidden Markov models. These techniques are described in the papers by Rabiner and Juang [5] as well as Kalman [6]. The difference with these presentations is that the latent dynamics are conditioned on actions and attempts to predict rewards, which allows the agent to imagine, yet not execute, actions in the provided environment.<br />
<br />
This short memory assumption is useful from a computational perspective as it allows for the problem to be tractable. It is also realistic, as an intelligent agent does not need the entire history of their environment going back all the way to the Big Bang to understand a situation they have not encountered before. We commend the team at UofT and Google Brain for this insight, as it makes their analysis reasonable and easy to understand.<br />
<br />
<br />
== Related Works ==<br />
<br />
Previous Works that exploited latent dynamics can be grouped in 3 sections:<br />
<br />
* Visual Control with latent dynamics by derivative-free policy learning or online planning.<br />
* Augment model-free agents with multi-step predictions.<br />
* Use analytic gradients of Q-values.<br />
<br />
While the later approaches are often for low-dimensional tasks, Dreamer uses analytic gradients to efficiently learn long-horizon behaviours for visual control purely by latent imagination.<br />
<br />
== Results ==<br />
In the following picture we can see the reward vs the environment steps. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|center|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
<br />
The figure below summarises Dreamer's performance compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance and overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviours with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|center|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent who learns how to perform rare surgeries without enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to higher visual complexity environments can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
The model components in Appendix A have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
Another fact about Dreamer is that it learns long-horizon behaviours purely by latent imagination, unlike previous approaches. It is also applicable to tasks with discrete actions and early episode termination.<br />
<br />
<br />
Learning a policy from visual inputs is a quite interesting research approach in RL. This paper steps in this direction by improving existing model-based methods (the world models and PlaNet) using the actor-critic approach, but in my point of view, their method was an incremental contribution as back-propagating gradients through values and dynamics has been studied in previous works.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviours by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.<br />
<br />
[5] Rabiner, Lawrence, and B. Juang. "An introduction to hidden Markov models." IEEE ASSP magazine 3.1 (1986): 4-16.<br />
<br />
[6] Kalman, Rudolph Emil. "A new approach to linear filtering and prediction problems." (1960): 35-45.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=49461orthogonal gradient descent for continual learning2020-12-06T20:05:54Z<p>Jlavilez: </p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Presented By == <br />
Parsa Torabian<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real-world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Let <math>A</math> be a real square matrix; this matrix accepts a QR decomposition, namely <math>A=\hat{Q}\hat{R}</math>, where <math>Q</math> is orthogonal and <math>R</math> is upper triangular. The prove of existence of a QR decomposition can be obtained using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
Note that this procedure can be trivially extended to complex square matrices, but in this case the matrix <math>Q</math> becomes unitary; i.e. <math>Q^* Q = QQ* = I</math>; this yields an easy extension of the orthogonal gradient descent algorithm for complex neural networks.<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png|centre]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px|centre]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The below figure shows the performance comparison of different methods when applied on the permuted MNIST. The comparison is made based on accuracy across 3 different tasks. Training is done for 15 epochs (5 for each of the three permutations). The switch in permutations is indicated in the graph with verticle lines.<br />
<br />
[[File:PMNIST_perf.PNG|centre]]<br />
<br />
The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG|centre]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. The following figure shows the accuracies of different methods when trained on Rotated MNIST with different degrees. Each method is trained for 10 epochs (5 on standard MNIST and 5 on rotated MNIST) and predictions are made over the original MNIST. Each accuracy bar is a mean over 10 runs.<br />
<br />
[[File:RMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:RMNIST.PNG|centre]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. The following figure shows the accuracies of different methods when trained on Split MNIST.<br />
<br />
[[File:SMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:SMNIST.PNG|centre]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of the number of gradients stored.<br />
<br />
[[File:ogd.png|centre]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [12] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [12] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=49457Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-12-06T19:51:04Z<p>Jlavilez: Commented on related work</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularisation techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavour to analyese, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimisation of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularisation agent, constraining the space of acceptable solutions to help the optimisation converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimisation is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimisation, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularises the optimisation, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regulariser. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimising the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimiser.<br />
<br />
==== L-BFGS optimiser [5] ====<br />
<br />
L-BFGS is an optimisation algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimise f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS.<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimise the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimisation can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimiser is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialisation and data normalisation be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimiser has been used to update parameters. Although they are more powerful that second order optimisers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimisers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
==Related Work==<br />
<br />
Finding (weak) solutions to differential equations using neural networks which take the differential operator as a loss function is not a new idea; the novelty arises from combining classical numerical methods from DE theory with the previous work in solving these differential equations.<br />
<br />
A natural question to ask is what happens in higher dimensions with regards to the curse of dimensionality. It is known that this becomes a big issue with numerical schemes such as finite element methods. A recent paper tackles this question in relation to high-dimensional versions of the Black-Scholes, Hamilton-Jacobi-Bellman, and Allen-Cahn equations [6], which can be found here: https://arxiv.org/abs/1707.02568.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that uses existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs<br />
<br />
[5] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimisation." Mathematical programming 45.1-3 (1989): 503-528.<br />
<br />
[6] Han, J., Jentzen, A., & Weinan, E. (2018). Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505-8510.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=49060Functional regularisation for continual learning with gaussian processes2020-12-04T01:21:40Z<p>Jlavilez: Corrected spelling and improved grammar</p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularise Continual Learning (CL) so that it does not forget previously learned tasks. This method, referred to as functional regularisation for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then used in optimisation as a regulariser to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimisation of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularisation-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularise the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularising weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimised using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularisation-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularises the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinite-dimensional generalisation of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimisation in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterised by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrised by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarise information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrise <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrised by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimising the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximising the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularise the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximised is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
As a result, we regularise the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimisation computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularisation term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularisation from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrised by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimised inducing images cover examples from all classes as opposed to the randomised inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasise that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimised using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularisation-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimisation scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularisation terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularisation terms can make optimisation more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=49059Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-04T01:08:34Z<p>Jlavilez: Corrected spelling and improved grammar</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterised variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalised Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimisation problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimise the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximised w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png|centre]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularisation term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularisation methods such as batch normalisation are used).<br />
* The order of optimising <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalised to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterised as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalised density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it uses the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrised sampler. In some works, regularisation (by noise addition, penalising gradients, spectral normalisation) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work. The following equation show how this method does not always sample from the generator but a small proportion (with probability p) of the samples come from real examples.<br />
<br />
<div align="center">[[File:related_work_equations.png]]</div><br />
<br />
Early works showed incorporation of Fisher Information to measure similarity and this was extended to use Fisher Vector representations in case of images. Recently, Fisher Information has been used for meta learning as well. This paper explores the possibility of using Fisher Information in deep learning generative models. By using the generator as a sampler, Fisher Information can be computed even from an unnormalised density model.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterises the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalisation ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 shows the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterised MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantisation.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=49057Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-12-04T01:02:05Z<p>Jlavilez: Editorial modifications</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularisation techniques or methods, which can artificially inflate the dataset, become particularly useful in these situations; however, such techniques are often highly dependent on the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavour to analyese, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimisation of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee the robustness of convergence in neural network training. In essence, the accompanying PDE model can be used as a regularisation agent, constraining the space of acceptable solutions to help the optimisation converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small number of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the Spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left-hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivatives of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimisation is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimisation, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularises the optimisation, allowing for the network to learn from a smaller number of data points than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full-time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of data points at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of data points at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change in procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regulariser. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after a sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also, assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimising the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 data points across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the data points are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also, assume that the value of the solution for each of the known data points is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the Limited-memory BFGS (L-BFGS) optimiser.<br />
<br />
==== L-BFGS optimiser [5] ====<br />
<br />
L-BFGS is an optimisation algorithm in the family of quasi-Newton methods and a popular algorithm for parameter estimation in machine learning. The aim in L-BFGS is to minimise f(x) over unconstrained values of the real-vector x where f is a differentiable scalar function. L-BFGS stores only fewer vectors that represent the approximation to the inverse hessian implicitly in comparison to the original BFGS.<br />
<br />
=== Results ===<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the data points selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One fascinating example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimise the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimisation can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by an additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
==Critiques==<br />
<br />
Although this paper has presented very interesting results and makes a bridge between machine learning and classical computational physics, some questions are still unanswered. For example, how deep should the neural network be? How much data is needed? Why the optimiser is not suffering from being trapped at local optima for the parameters of the differential operators ? Can weight initialisation and data normalisation be improved? Why these methods seem to be very robust to noise in data? How can uncertainty in predictions be interpreted which hints us to the concept of interpretable AI. The answers to these questions can be next steps for this research direction.<br />
<br />
In this paper, a Quasi-Newton optimiser has been used to update parameters. Although they are more powerful that second order optimisers, however, due to their computational load, they are not the common choice in today's deep learning packages. Considering this, do the first order optimisers handle updating the weights in such a problem? Or they may get stuck in local minima? There is no such experiment in the paper.<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that uses existing information on physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for the prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely available on GitHub [4]. It is quite easy to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs<br />
<br />
[5] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimisation." Mathematical programming 45.1-3 (1989): 503-528.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=49050orthogonal gradient descent for continual learning2020-12-04T00:36:13Z<p>Jlavilez: </p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Presented By == <br />
Parsa Torabian<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real-world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Suppose we have a matrix <math>A</math> which is to be decomposed into <math>A=\hat{Q}\hat{R}</math> using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png|centre]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px|centre]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The below figure shows the performance comparison of different methods when applied on the permuted MNIST. The comparison is made based on accuracy across 3 different tasks. Training is done for 15 epochs (5 for each of the three permutations). The switch in permutations is indicated in the graph with verticle lines.<br />
<br />
[[File:PMNIST_perf.PNG|centre]]<br />
<br />
The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG|centre]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. The following figure shows the accuracies of different methods when trained on Rotated MNIST with different degrees. Each method is trained for 10 epochs (5 on standard MNIST and 5 on rotated MNIST) and predictions are made over the original MNIST. Each accuracy bar is a mean over 10 runs.<br />
<br />
[[File:RMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:RMNIST.PNG|centre]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. The following figure shows the accuracies of different methods when trained on Split MNIST.<br />
<br />
[[File:SMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:SMNIST.PNG|centre]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of the number of gradients stored.<br />
<br />
[[File:ogd.png|centre]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [12] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [12] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=49049DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-12-04T00:30:05Z<p>Jlavilez: Corrected spelling and improved grammar</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning (RL) is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning. It refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalising' the network based on its behaviour over time. Intelligent agents are able to accomplish tasks that may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviours purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance. The proposed method is based on model-free RL with latent state representation that is learned via prediction. The authors have changed the belief representations to learn a critic directly on latent state samples which help to enable scaling to more complex tasks. <br />
<br />
The main findings of the paper are that long-horizon behaviours can be learned by latent imagination. This avoids the short sightedness that comes with using finite imagination horizons. The authors have also managed to demonstrate empirical performance for visual control by evaluating the model on image inputs.<br />
<br />
[[File:Figure1 paper.png|100px|center]]<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math> as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively. Informally, <math>V_{\pi}</math> tells one how good a state is in terms of the expected return when starting in the state <math>s</math> and then following the policy <math>\pi</math>. Similarly <math>Q_{\pi}</math> gives the value of the expected return starting from the state <math>s</math>, taking the action <math>a</math>, and subsequently following the policy <math>\pi</math>. <br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. In a high-level perspective, Dreamer first learns latent dynamics from past experience, then it learns actions and states from imagined trajectories to maximise future action rewards. Finally, it predicts the next action and executes it. This whole process is illustrated below. <br />
<br />
[[File: dreamer_overview.png | 600px | center]]<br />
<br />
<br />
Let's look at Dreamer in detail. It consists of :<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math>o_{t}</math> is the observation at time <math>t</math> and <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
The main three components of agent learning in imagination are dynamics learning, behaviour learning, and environment interaction. In the compact latent space of the world model, the behaviour is learned by predicting hypothetical trajectories. Throughout the agent's lifetime, Dreamer performs the following operations either in parallel or interleaved as shown in Figure 3 and Algorithm 1:<br />
<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behaviour Learning: In the latent space, the agent predicts state values and actions that maximize future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:ashraf98.png|frameless|700px|Dreamer algorithm|center]]<br />
<br />
Notice that three neural networks are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively. The action model tries to solve the imagination environment by predicting various actions. Meanwhile, the value model estimates the expected rewards that the action model will achieve. Hence, these two models are trained cooperatively whereby the action model tries to maximize the estimated value while the value model gives the estimate based on the action model's actions.<br />
<br />
== Related Works ==<br />
<br />
Previous Works that exploited latent dynamics can be grouped in 3 sections:<br />
<br />
* Visual Control with latent dynamics by derivative-free policy learning or online planning.<br />
* Augment model-free agents with multi-step predictions.<br />
* Use analytic gradients of Q-values.<br />
<br />
While the later approaches are often for low-dimensional tasks, Dreamer uses analytic gradients to efficiently learn long-horizon behaviours for visual control purely by latent imagination.<br />
<br />
== Results ==<br />
In the following picture we can see the reward vs the environment steps. As we can see the Dreamer outperforms other baseline algorithms. Moreover, the convergence is a lot faster in the Dreamer algorithm. <br />
[[File:dreamer.paper19.png|center|frameless|500px|Rewards vs environment steps of Dreamer and other baseline algorithms]]<br />
<br />
<br />
The figure below summarises Dreamer's performance compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Using the same hyper parameters for all tasks, Dreamer exceeds previous model-based and model-free agents in terms of data-efficiency, computation time, and final performance and overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviours with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|center|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent who learns how to perform rare surgeries without enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment. Also, as a future work on representation learning, the ability to scale latent imagination to higher visual complexity environments can be investigated.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at https://github.com/google-research/dreamer. <br />
<br />
== Critique ==<br />
This paper presents an approach that involves learning a latent dynamics model to learn 20 visual control tasks.<br />
<br />
The model components in Appendix A have mentioned that "three dense layers of size 300 with ELU activations" and "30-dimensional diagonal Gaussians" have been used for distributions in latent space. The paper would have benefitted from pointing out how come they have come up with this architecture as their model. In other words, how the latent vector determines the performance of the agent.<br />
<br />
Another fact about Dreamer is that it learns long-horizon behaviours purely by latent imagination, unlike previous approaches. It is also applicable to tasks with discrete actions and early episode termination.<br />
<br />
<br />
Learning a policy from visual inputs is a quite interesting research approach in RL. This paper steps in this direction by improving existing model-based methods (the world models and PlaNet) using the actor-critic approach, but in my point of view, their method was an incremental contribution as back-propagating gradients through values and dynamics has been studied in previous works.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviours by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=45219GradientLess Descent2020-11-17T23:28:18Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, then this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">\textbf{H}(f)</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of an <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available, or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. Here the latent dimension is <math display="inline"> k </math> and <math display="inline"> k < n </math>, where <math display="inline"> n </math> is the input dimension.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform a binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
<math display="inline"> \forall x \in K</math> s.t. <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with a quarter probability, <math display="inline">f(y)</math> is closer to the optimum,<math display="inline"> f(x^*), </math> than <math display="inline">f(x)</math> is.<br />
<br />
For proof of these theorems, please watch my talk.<br />
<br />
[[File: GLD2.PNG|frame| Gradientless Descent with Fast Binary Search.]]<br />
<br />
In the current form of GradientLess Descent Algorithm presented here, the lower and upper limits of the search radius i.e. <math display="inline">[r, R]</math> remain unchanged for the entire run of the algorithm. As proven by the correctness of this algorithm, this does ensure convergence but this version of the algorithm does not take advantage of the upper bound of the condition number <math display="inline">Q</math> and therefore, has an extra factor of <math display="inline">\log \frac{1}{\epsilon}</math> in its overall cost.<br />
<br />
A variation of this algorithm termed '''Gradientless Descent with Fast Binary Search (GLD-Fast)''', eliminates this additional factor from the overall cost through reduction in the range of the binary search by shrinking <math display="inline">R</math> in half after every <math display="inline">H</math> iterations (where <math display="inline">H</math> is determined by <math display="inline">Q</math>).<br />
<br />
<br />
<br />
For determining K and H, use the following equations:<br />
<br />
K = log(4√Q)<br />
<br />
H = nQ log(Q)<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.<br />
<br />
[[File:GLDBeatsARS.PNG|1000px|]]<br />
<br />
For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark.<br />
<br />
==Conclusion==<br />
This research paper has analysed a randomised algorithm where a search direction is sampled from the standard Gaussian. This is a direct search-based algorithm, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search all the possible values of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient.<br />
<br />
==Critiques==<br />
<br />
1- Although the theoretical guarantees presented in the paper are interesting, this is not clear how this algorithm is applicable in practice. This is because this paper assumes we do not have access to the objective function, and they are only able to use function evaluations. Besides, there is a strong assumption that the function is smooth and strongly convex. Considering this, my main concern is how we can make sure the objective function is smooth and strongly convex while we do not have access to it explicitly? (if we have explicit access to the function and this is smooth and strongly convex, why shouldn't we use gradient-based methods?!) Further, what happens if the objective function violates the smooth and strongly convex condition? Can we still employ this algorithm?<br />
<br />
In response to the above comments:<br />
<br />
This algorithm has many practical applications especially in the field of reinforcement learning. A major concept in reinforcement learning is the concept of a reward function (which we either wish to minimise or maximise). In particular, the reward function may be hidden behind a black box. For example, consider a "theoretical" slot machine where we only see how much money we get if we win, but do not know how the amount is determined. It is true that in general, these objective functions may not be smooth or strongly convex, but one is able to either make certain assumptions about the reward function or relax certain conditions about the state of the world in order to create a reward function that is smooth or convex. Additionally, certain gradients may not have an analytical form, in which case numerical calculation for gradients may be computationally expensive. This method allows a way to bypass the gradient computations altogether!<br />
<br />
To back the response: <br />
They have demonstrated that their algorithm can be successfully applied to '''MuJoCo''' benchmarks, where the objective function is '''not''' strongly convex and smooth.<br />
- providing more graphical representation in proving lemmas, would make the paper more fathomable.<br />
<br />
==Bibliography==<br />
<br />
1. Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimisation". In: arXiv preprint arXiv:1911.06317 (2019).<br />
<br />
2. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70.<br />
<br />
3. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimisation of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566.<br />
<br />
4. R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43489GradientLess Descent2020-11-08T21:02:28Z<p>Jlavilez: Do not butcher the Queen's, please. Reverted grammar to proper English.</p>
<hr />
<div>==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, then this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">\textbf{H}(f)</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of an <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available, or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. Here the latent dimension is <math display="inline"> k </math> and <math display="inline"> k < n </math>, where <math display="inline"> n </math> is the input dimension.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
<math display="inline"> \forall x \in K</math> s.t. <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with a quarter probability, <math display="inline">f(y)</math> is closer to the optimum,<math display="inline"> f(x^*), </math> than <math display="inline">f(x)</math> is.<br />
<br />
For proof of these theorems, please watch my talk.<br />
<br />
[[File: GLD2.PNG|frame| Gradientless Descent with Fast Binary Search.]]<br />
<br />
In the current form of GradientLess Descent Algorithm presented here, the lower and upper limits of the search radius i.e. <math display="inline">[r, R]</math> remain unchanged for the entire run of the algorithm. As proven by the correctness of this algorithm, this does ensure convergence but this version of the algorithm does not take advantage of the upper bound of the condition number <math display="inline">Q</math> and therefore, has an extra factor of <math display="inline">\log \frac{1}{\epsilon}</math> in its overall cost.<br />
<br />
A variation of this algorithm termed '''Gradientless Descent with Fast Binary Search (GLD-Fast)''', eliminates this additional factor from the overall cost through reduction in the range of the binary search by shrinking <math display="inline">R</math> in half after every <math display="inline">H</math> iterations (where <math display="inline">H</math> is determined by <math display="inline">Q</math>).<br />
<br />
<br />
<br />
For determining K and H, use the following equations:<br />
<br />
K = log(4√Q)<br />
<br />
H = nQ log(Q)<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.<br />
<br />
[[File:GLDBeatsARS.PNG|1000px|]]<br />
<br />
For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark.<br />
<br />
==Conclusion==<br />
This research paper has analysed a randomised algorithm where a search direction is sampled from the standard Gaussian. This is a direct search-based algorithm, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search all the possible values of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient.<br />
<br />
==Critiques==<br />
<br />
Although the theoretical guarantees presented in the paper are interesting, this is not clear how this algorithm is applicable in practice. This is because this paper assumes we do not have access to the objective function, and they are only able to use function evaluations. Besides, there is a strong assumption that the function is smooth and strongly convex. Considering this, my main concern is how we can make sure the objective function is smooth and strongly convex while we do not have access to it explicitly? (if we have explicit access to the function and this is smooth and strongly convex, why shouldn't we use gradient based methods?!) Further, what happens if the objective function violates the smooth and strongly convex condition? Can we still employ this algorithm?<br />
<br />
==Bibliography==<br />
<br />
1. Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimization". In: arXiv preprint arXiv:1911.06317 (2019).<br />
<br />
2. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70.<br />
<br />
3. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566.<br />
<br />
4. R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43273GradientLess Descent2020-11-03T18:48:13Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">\textbf{H}f</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. Here the latent dimension is <math display="inline"> k </math> and <math display="inline"> k < n </math>, where <math display="inline"> n </math> is the input dimension.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
<math display="inline"> \forall x \in K</math> s.t. <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with a quarter probability, <math display="inline">f(y)</math> is closer to the optimum,<math display="inline"> f(x^*), </math> than <math display="inline">f(x)</math> is.<br />
<br />
For proofs of these theorems, please watch my talk.<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.<br />
<br />
[[File:GLDBeatsARS.PNG|1000px|]]<br />
<br />
For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark.<br />
<br />
==Conclusion==<br />
This research paper has analyzed a randomized algorithm where a search direction is sampled from the standard Gaussian. This algorithm is a direct search-based, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search of all possible value of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient.<br />
<br />
<br />
==Bibliography==<br />
<br />
1. Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimization". In: arXiv preprint arXiv:1911.06317 (2019).<br />
<br />
2. Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70.<br />
<br />
3. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566.<br />
<br />
4. R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43272GradientLess Descent2020-11-03T18:47:46Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
<math display="inline"> \forall x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">\textbf{H}f</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound. Here the latent dimension is <math display="inline"> k </math> and <math display="inline"> k < n </math>, where <math display="inline"> n </math> is the input dimension.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
<math display="inline"> \forall x \in K</math> s.t. <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with a quarter probability, <math display="inline">f(y)</math> is closer to the optimum,<math display="inline"> f(x^*), </math> than <math display="inline">f(x)</math> is.<br />
<br />
For proofs of these theorems, please watch my talk.<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.<br />
<br />
[[File:GLDBeatsARS.PNG|1000px|]]<br />
<br />
For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark.<br />
<br />
==Conclusion==<br />
This research paper has analyzed a randomized algorithm where a search direction is sampled from the standard Gaussian. This algorithm is a direct search-based, which yields the convergence rate that is polylogarithmically dependent on dimensionality for any monotone transform of a smooth and strongly convex objective with a low-dimensional structure. In this algorithm, the step-size is considered as an approximate line to search of all possible value of a grid spanning an interval with uniform spacing on a log-scale. They show a geometric decrease of the function value regret, up to a constant defined by the minimum step-size, on strongly convex functions with Lipschitz smooth gradient.<br />
<br />
<br />
==Bibliography==<br />
<br />
[1] Daniel Golovin et al. Gradientless descent: High-dimensional zeroth-order optimization". In: arXiv preprint arXiv:1911.06317 (2019).<br />
[2] Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap". In: Asian Journal of Mathematics and Statistics 4.1 (2011), pp. 66-70.<br />
[3] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. In: Foundations of Computational Mathematics 17.2 (2017), pp. 527-566.<br />
[4] R Tyrrell Rockafellar. Convex analysis. 28. Princeton university press, 1970.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=43145stat940F212020-11-02T17:46:39Z<p>Jlavilez: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:GradientLessDescent.pdf&diff=43142File:GradientLessDescent.pdf2020-11-02T17:43:04Z<p>Jlavilez: </p>
<hr />
<div></div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43141GradientLess Descent2020-11-02T17:41:29Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
For any <math display="inline">x \in K</math> such that <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with probability a quarter, <math display="inline">f(y)</math> is closer to the optimum than <math display="inline">f(x)</math> is.<br />
<br />
For proofs of these theorems, please watch my talk.<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.<br />
<br />
[[File:GLDBeatsARS.PNG|1000px|]]<br />
<br />
For this comparison, we defined the function <math display="inline">f(x) = \frac{1}{2} x^T H x </math> where <math display="inline">H</math> is a diagonal matrix with eigenvalues linearly interpolating the interval <math display="inline">[\alpha , \beta]</math>. We observe that in most scenarios, GradientLess Descent beats the benchmark.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:GLDBeatsARS.PNG&diff=43140File:GLDBeatsARS.PNG2020-11-02T17:36:16Z<p>Jlavilez: </p>
<hr />
<div></div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43139GradientLess Descent2020-11-02T17:36:05Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
In this presentation, we are interested in minimising a smooth convex function without ever computing its derivatives.<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
===Proof of correctness===<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
For any <math display="inline">x \in K</math> such that <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with probability a quarter, <math display="inline">f(y)</math> is closer to the optimum than <math display="inline">f(x)</math> is.<br />
<br />
For proofs of these theorems, please watch my talk.<br />
<br />
==Results==<br />
<br />
We compare the GradientLess Descent algorithm to a benchmark established by the Augmented Randomised Search algorithm proposed in 2011.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43138GradientLess Descent2020-11-02T17:34:21Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
==Proof of correctness==<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.<br />
<br />
<br />
Using this theorem about random searches in high dimensions, we can prove the correctness of our algorithm.<br />
<br />
'''Theorem 2'''<br />
<br />
For any <math display="inline">x \in K</math> such that <math display="inline">\frac{3}{5Q} ||x - x^*|| \in [C_1, C_2]</math>, we can find integers <math display="inline">0 \leq k_1, k_2 < \log \frac{C_2}{C_1}</math> such that if <math display="inline">r = 2^{k_1}C_1</math> or <math display="inline">r = 2^{-k_2}C_2</math>, then a sample <math display="inline">y</math> from the uniform distribution on <math display="inline">B_x = B\left( x, \frac{r}{\sqrt{n}} \right) </math> satisfies<br />
\begin{align*}<br />
f(y) - f(x^*) \leq (f(x) - f(x^*)) \left( 1- \frac{1}{5nQ} \right)<br />
\end{align*}<br />
with probability at least <math display="inline">\frac{1}{4}</math>.<br />
<br />
<br />
Notice how the second theorem implies that with probability a quarter, <math display="inline">f(y)</math> is closer to the optimum than <math display="inline">f(x)</math> is.<br />
<br />
For proofs of these theorems, please watch my talk.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=43134GradientLess Descent2020-11-02T17:25:57Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithm is given in the picture below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
Observe that at each step, we perform binary search over several concentric circles and randomly sample points, in the hopes that if we take a small step in a random direction this will reduce the value of the objective function.<br />
<br />
==Proof of correctness==<br />
<br />
The correctness of this algorithm hinges on two observations. The first one is about the volume of the intersection of high-dimensional balls; we call this intersection a hyperspherical cap.<br />
<br />
'''Theorem 1'''<br />
<br />
Let <math display="inline">B_1, B_2 \subseteq \mathbb{R}^n</math> be balls of radii <math display="inline">r_1, r_2</math>. Let <math display="inline">\ell</math> be the distance between the centres. If <math display="inline">r_1 \in \left[ \frac{\ell}{2 \sqrt{n}} , \frac{\ell}{\sqrt{n}} \right]</math> and <math display="inline">r_2 \geq \ell - \frac{\ell}{4n}</math>, then <math display="inline">\lambda (B_1 \cap B_2) \geq c_n \lambda (B_1)</math>, where <math display="inline">c_n \geq \frac{1}{4}</math>.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42994GradientLess Descent2020-10-31T23:56:07Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
The proposed algorithms are given in the pictures below.<br />
<br />
[[File:GLD1.PNG|frame|Gradientless Descent with Binary Search.]]<br />
<br />
[[File:GLD2.PNG|frame|Gradietless Descent with Fast Binary Search.]]<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:GLD2.PNG&diff=42993File:GLD2.PNG2020-10-31T23:52:58Z<p>Jlavilez: </p>
<hr />
<div></div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:GLD1.PNG&diff=42992File:GLD1.PNG2020-10-31T23:52:46Z<p>Jlavilez: </p>
<hr />
<div></div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42991GradientLess Descent2020-10-31T23:47:26Z<p>Jlavilez: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms.<br />
<br />
This paper presents a purely gradientless algorithm, proposes a geometric approach to analyse the algorithm, and proves a <math display="inline">O( k Q \log (n / \epsilon ))</math> convergence bound.<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42990GradientLess Descent2020-10-31T23:44:32Z<p>Jlavilez: /* Zeroth-Order Optimisation */</p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
In zeroth-order optimisation, we are interested in minimising a function without computing its derivatives. This is important in many practical applications in which derivatives may not be available or they may be difficult to compute, such as:<br />
<br />
* Combinatorial (i.e. discrete) optimisation<br />
* Instances of non-analytic loss functions (e.g. hyperparameter tuning)<br />
* Adversarial attacks<br />
* Reinforcement learning<br />
<br />
Curiously, a large amount of this approach focuses on approximating gradients and then using first-order optimisation algorithms<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42985GradientLess Descent2020-10-31T23:00:35Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds.<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
In convex analysis, one usually says that a function has condition number <math display="inline">Q</math> if it is both <math display="inline">\alpha</math>-strongly convex, and <math display="inline">\beta</math>-smooth, and <math display="inline">\frac{\beta}{\alpha} \leq Q</math>.<br />
The authors of this paper consider the more general case where <math display="inline">f</math> is a monotone transformation of a <math display="inline">\alpha</math>-strongly convex and <math display="inline">\beta</math>-smooth function; for simplicity and transparency, we shall not consider these extensions here, but shall note that their proofs are quite elementary.<br />
<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42977GradientLess Descent2020-10-31T22:21:33Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds:<br />
<br />
[[File:ConvexSmooth.PNG|frame|Relationship between convexity and smoothness.]]<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42976GradientLess Descent2020-10-31T22:20:54Z<p>Jlavilez: /* Introduction */</p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>. Further intuition can be gained from the image below, showing how such a function can be contained within quadratic bounds:<br />
<br />
[[File:ConvexSmooth.PNG|thumb|Relationship between convexity and smoothness.]]<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=User:Jlavilez&diff=42975User:Jlavilez2020-10-31T22:17:38Z<p>Jlavilez: Created page with "Visit my UWaterloo Scholar website: [https://uwaterloo.ca/scholar/jlavilez/]"</p>
<hr />
<div>Visit my UWaterloo Scholar website: [https://uwaterloo.ca/scholar/jlavilez/]</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:ConvexSmooth.PNG&diff=42974File:ConvexSmooth.PNG2020-10-31T22:16:41Z<p>Jlavilez: Relationship between convexity and smoothness.</p>
<hr />
<div>Relationship between convexity and smoothness.</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42973GradientLess Descent2020-10-31T22:08:41Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this is simply equivalent to the eigenvalues of the Hessian matrix <math display="inline">Hf</math> being bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42972GradientLess Descent2020-10-31T22:07:15Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
To be even more precise, in our context we let <math display="inline">K \subseteq \mathbb{R}^n</math> be compact <math display="inline">f : K \to \mathbb{R}</math> be <math display="inline">\beta</math>-smooth and <math display="inline">\alpha</math>-strongly convex.<br />
<br />
'''Definition 1'''<br />
<br />
A convex continuously differentiable function <math display="inline">f : K \to \mathbb{R}</math> is <math display="inline">\alpha</math>-strongly convex for <math display="inline">\alpha > 0</math> if <br />
\begin{align*}<br />
f(y) \geq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\alpha}{2} ||y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math>. It is called <math display="inline">\beta</math>-smooth for <math display="inline">\beta > 0</math> if <br />
\begin{align*}<br />
f(y) \leq f(x) + \left\langle \nabla f(x), y-x\right\rangle + \frac{\beta}{2} || y - x||^2<br />
\end{align*}<br />
for all <math display="inline"> x,y \in K </math><br />
<br />
<br />
We remark that if <math display="inline">f</math> is twice continuously differentiable, this simply means that the eigenvalues of the Hessian matrix <math display="inline">Hf</math> are bounded between <math display="inline">\alpha</math> and <math display="inline">\beta</math>.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42971GradientLess Descent2020-10-31T21:24:53Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations. This class of functions is quite common in practice; for instance, they make special appearances in the reinforcement learning literature.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42970GradientLess Descent2020-10-31T21:23:51Z<p>Jlavilez: </p>
<hr />
<div>==Introduction==<br />
<br />
===Motivation and Set-up===<br />
<br />
A general optimisation question can be formulated by asking to minimise an objective function <math display="inline">f : \mathbb{R}^n \to \mathbb{R}</math>, which means finding:<br />
\begin{align*}<br />
x^* = \mathrm{argmin}_{x \in \mathbb{R}^n} f(x)<br />
\end{align*} <br />
<br />
Depending on the nature of <math display="inline">f</math>, different settings may be considered:<br />
<br />
* Convex vs non-convex objective functions;<br />
* Differentiable vs non-differentiable objective functions;<br />
* Allowed function or gradient computations;<br />
* Noisy/Stochastic oracle access.<br />
<br />
For the purpose of this paper, we consider convex smooth objective noiseless functions, where we have access to function computations but not gradient computations.<br />
<br />
===Zeroth-Order Optimisation===<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=42965stat940F212020-10-31T20:21:59Z<p>Jlavilez: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| STRUCTBERT:INCORPORATING LANGUAGE STRUCTURES INTO PRETRAINING FOR DEEP LANGUAGE UNDERSTANDING || [https://openreview.net/pdf?id=BJgQ4lSFPH] || ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Probabilistic Model-Agnostic Meta-Learning || [http://papers.nips.cc/paper/8161-probabilistic-model-agnostic-meta-learning.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || |-<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Boosting Few-Shot Visual Learning with Self-Supervision || https://openaccess.thecvf.com/content_ICCV_2019/papers/Gidaris_Boosting_Few-Shot_Visual_Learning_With_Self-Supervision_ICCV_2019_paper.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=GradientLess_Descent&diff=42964GradientLess Descent2020-10-31T20:21:41Z<p>Jlavilez: Created page with "==Introduction== ==Motivation and Set-up== ==Zeroth-Order Optimisation== ==GradientLess Descent Algorithm== ==Proof of correctness=="</p>
<hr />
<div>==Introduction==<br />
<br />
==Motivation and Set-up==<br />
<br />
==Zeroth-Order Optimisation==<br />
<br />
==GradientLess Descent Algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946F20/GradientLess_Descent&diff=42963stat946F20/GradientLess Descent2020-10-31T20:17:43Z<p>Jlavilez: Created page with "== Introduction == ==Motivation and Setup== ==Zeroth-Order Optimisation== ==GradientLess Descent algorithm== ==Proof of correctness=="</p>
<hr />
<div>== Introduction ==<br />
<br />
==Motivation and Setup==<br />
<br />
==Zeroth-Order Optimisation==<br />
<br />
==GradientLess Descent algorithm==<br />
<br />
==Proof of correctness==</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=42836stat940F212020-10-26T18:42:24Z<p>Jlavilez: </p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || ||<br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || ||<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || https://openreview.net/pdf?id=H1eA7AEtvS || ||<br />
|-<br />
|Week of Nov 2 ||John Edwards || 4||Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge ||[https://papers.nips.cc/paper/8375-learn-imagine-and-create-text-to-image-generation-from-prior-knowledge.pdf Paper] || ||<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| STRUCTBERT:INCORPORATING LANGUAGE STRUCTURES INTO PRETRAINING FOR DEEP LANGUAGE UNDERSTANDING || [https://openreview.net/pdf?id=BJgQ4lSFPH] || ||<br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || ||<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html || ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || ||<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Probabilistic Model-Agnostic Meta-Learning || [http://papers.nips.cc/paper/8161-probabilistic-model-agnostic-meta-learning.pdf Paper] || ||<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || ||<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| EMPIRICAL STUDIES ON THE PROPERTIES OF LINEAR REGIONS IN DEEP NEURAL NETWORKS || [https://openreview.net/pdf?id=SkeFl1HKwr Paper] || ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || ||<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Incorporating BERT into Neural Machine Translation || [https://iclr.cc/virtual_2020/poster_Hyl7ygStwB.html Paper] || ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| Sparse Convolutional Neural Networks || [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Liu_Sparse_Convolutional_Neural_2015_CVPR_paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Boosting Few-Shot Visual Learning with Self-Supervision || https://openaccess.thecvf.com/content_ICCV_2019/papers/Gidaris_Boosting_Few-Shot_Visual_Learning_With_Self-Supervision_ICCV_2019_paper.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||<br />
|-</div>Jlavilezhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=F21-STAT_940-Proposal&diff=42698F21-STAT 940-Proposal2020-10-09T18:47:39Z<p>Jlavilez: </p>
<hr />
<div>Use this format (Don’t remove Project 0)<br />
<br />
Project # 0 Group members:<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Last name, First name<br />
<br />
Title: Making a String Telephone<br />
<br />
Description: We use paper cups to make a string phone and talk with friends while learning about sound waves with this science project. (Explain your project in one or two paragraphs).<br />
<br />
<br />
<br />
<br />
Project # 1 Group members:<br />
<br />
McWhannel, Pierre<br />
<br />
Yan, Nicole<br />
<br />
Hussein Salamah, Ahmed <br />
<br />
Title: Dense Retrieval for Conversational Information Seeking <br />
<br />
Description:<br />
One of the recognized problems in Information Retrieval (IR) is the conversational search that attracts much attention in form of Conversational Assistants such as Alexa, Siri and Cortana. The users’ needs are the ultimate goal of conversational search systems, in this context the questions are asked sequentially imposing a multi-turn format as the Conversational Information Seeking (CIS) task. TREC Conversational Assistance Track (CAsT) [3] is a multi-turn conversational search task as it contains a large-scale reusable test collection for sequences of conversational queries. The response of this conversational model is not a list of relevant documents, but it is limited to brief response passages with a length of 1 to 3 sentences in length.<br />
<br />
[[File:Screen Shot 2020-10-09 at 1.33.00 PM.png | 300px | Example Queries in CAsT]]<br />
<br />
In [4], the authors focus on improving open domain question answering by including dense representations for retrieval instead of the traditional methods. They have adopted a simple dual-encoder framework to construct a learnable retriever on large collections. We want to adopt this dense representation for the conversational model in the CAsT task and compare it with the performance of the other approaches in literature. The performance will be indicated by using graded relevance on five point, which are Fails to meet, Slightly meets, Moderately meets, Highly meets, and Fully meets.<br />
<br />
We aim to further improve our system performance by integrating the following techniques:<br />
<br />
• Paragraph-level pre-training tasks: ICT, BFS, and WLP [1]<br />
<br />
• ANCE training: periodically using checkpoints to encode documents, from which the strong negatives close to the relevant document would be used as next training negatives [5]<br />
<br />
In summary, this project is exploratory in nature as we will be trying to use state-of-art Dense Passage Retrieval techniques (based on BERT) [4, 6], in a question answering (QA) problem. Current first-stage-retrieval approaches mainly rely on bag-of-words models. In this project, we hope to explore the feasibility of using state-of-art methods such as BERT. We will first compare how these perform on the TREC CAsT datasets [3] against the results retrieved using BM25. After these first points of comparison we will next explore methods of improving DPR by exploring one or more techniques that are made to improve the performance of DPR. [1, 5].<br />
<br />
References<br />
<br />
[1] Wei-Cheng Chang et al. Pre-training Tasks for Embedding-based Large-scale Retrieval. 2020. arXiv: 2002.03932 [cs.LG].<br />
<br />
[2] Zhuyun Dai and Jamie Callan. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. 2019. arXiv: 1910.10687 [cs.IR].<br />
<br />
[3] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. TREC CAsT 2019: The Conversational Assistance Track Overview. 2020. arXiv: 2003.13624 [cs.IR].<br />
<br />
[4] Vladimir Karpukhin et al. Dense Passage Retrieval for Open-Domain Ques- tion Answering. 2020. arXiv: 2004.04906 [cs.CL].<br />
<br />
[5] Lee Xiong et al. Approximate Nearest Neighbor Negative Contrastive Learn- ing for Dense Text Retrieval. 2020. arXiv: 2007.00808 [cs.IR].<br />
<br />
[6] Jingtao Zhan et al. RepBERT: Contextualized Text Embeddings for First- Stage Retrieval. 2020. arXiv: 2006.15498 [cs.IR].<br />
<br />
<br />
<br />
Project # 2 Group members:<br />
<br />
Singh, Gursimran<br />
<br />
Sharma, Govind<br />
<br />
Chanana, Abhinav<br />
<br />
Title: Quick Text Description using Headline Generation and Text To Image Conversion<br />
<br />
Description: An automatic tool to generate short description based on long textual data is a useful mechanism to share quick information. Most of the current approaches involve summarizing the text using varied deep learning approaches from Transformers to different RNNs. For this project, instead of building a standard text summarizer, we aim to provide two separate utilities for generating a quick description of the text. First, we plan to develop a model that produces a headline for the long textual data, and second, we are intending to generate an image describing the text. <br />
<br />
Headline Generation - Headline generation is a specific case of text summarization where the output is generally a combination of few words that gives an overall outcome from the text. In most cases, text summarization is an unsupervised learning problem. But, for the headline generation, we have the original headlines available in our training dataset that makes it a supervised learning task. We plan to experiment with different Recurrent Neural Networks like LSTMs and GRUs with varied architectures. For model evaluation, we are considering BERTScore using which we can compare the reference headline with the automatically generated headline from the model. We also aim to explore attention models for the text (headline) generation. We will make use of the currently available techniques mentioned in the various research papers but also try to develop our own architecture if the previous methods don't reveal reliable results on our dataset. Therefore, this task would primarily fit under the category of application of deep learning to a particular domain, but could also include some components of new algorithm design.<br />
<br />
Text to Image Conversion - Generation or synthesis of images from a short text description is another very interesting application domain in deep learning. One approach for image generation is based on mapping image pixels to specific features as described by the discriminative feature representation of the text. Recurrent Neural Networks have been successfully used in learning such feature representations of text. This approach is difficult to generalize because the recognition of discriminative features for texts in different domains is not an easy task and it requires domain expertise. Different generative methods have been used including Variational Recurrent Auto-Encoders and its extension in Deep Recurrent Attention Writer (DRAW). We plan to experiment with Generative Adversarial Networks (GAN). Application of GANs on domain-specific datasets has been done but we aim to apply different variants of GANs on the Microsoft COCO dataset which has been used in other architectures. The analysis will be focusing on how well GANs are able to generalize when compared to other alternatives on the given dataset.<br />
<br />
Scope - The above models will be trained independently on different datasets. Therefore, for a particular text, only one of the two functionalities will be available.<br />
<br />
<br />
<br />
Project # 3 Group members:<br />
<br />
Sikri, Gaurav<br />
<br />
Bhatia, Jaskirat<br />
<br />
Title: Not decided yet (Placeholder)<br />
<br />
Description: Not decided yet :)<br />
<br />
<br />
Project # 4 Group members:<br />
<br />
Maleki, Danial<br />
<br />
Rasoolijaberi, Maral<br />
<br />
Title: Binary Deep Neural Network for the domain of Pathology<br />
<br />
Description: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. We want to investigate the possibility of using these types of networks in the domain of histopathology as it has gigapixels images which make the use of them very useful.<br />
<br />
<br />
Project # 5 Group members:<br />
<br />
Jain, Abhinav<br />
<br />
Bathla, Gautam<br />
<br />
Title: lyft-motion-prediction-autonomous-vehicles(Kaggle)(Tentative)<br />
<br />
Description: Autonomous vehicles (AVs) are expected to dramatically redefine the future of transportation. However, there are still significant engineering challenges to be solved before one can fully realize the benefits of self-driving cars. One such challenge is building models that reliably predict the movement of traffic agents around the AV, such as cars, cyclists, and pedestrians.<br />
<br />
Comments: We are more inclined towards a 3-D object detection project. We are in the process of finding the right problem statement for it and if we are not successful, we will continue with the above Kaggle competition.<br />
<br />
<br />
Project # 6 Group members:<br />
<br />
You, Bowen<br />
<br />
Avilez, Jose<br />
<br />
Mahmoud, Mohammad<br />
<br />
Wu, Mohan<br />
<br />
Title: Deep Learning Models in Volatility Forecasting<br />
<br />
Description: Price forecasting has become a very hot topic in the financial industry in recent years. We are however very interested in the volatility of such financial instruments. We propose a new deep learning architecture or model to predict volatility and apply our model to real life datasets of various financial products. We will analyze our results and compare them to more traditional methods.<br />
<br />
<br />
Project # 7 Group members:<br />
<br />
Chen, Meixi<br />
<br />
Shen, Wenyu<br />
<br />
Title: Through the Lens of Probability Theory: A Comparison Study of Bayesian Deep Learning Methods<br />
<br />
Description: Deep neural networks have been known as black box models, but they can be made less mysterious when adopting a Bayesian approach. From a Bayesian perspective, one is able to assign uncertainty on the weights instead of having single point estimates, which allows for a better interpretability of deep learning models. However, Bayesian deep learning methods are often intractable due an increase amount of parameters and often times don't have as good performance. In this project, we will study different BDL methods such as Bayesian CNN using variational inference and Laplace approximation, with applications on image classification, and we will try to propose improvements where possible.<br />
<br />
<br />
Project # 8 Group members:<br />
<br />
Avilez, Jose<br />
<br />
Title: A functional universal approximation theorem<br />
<br />
Description: In the seminal paper "Approximation by superpositions of a sigmoidal function", Cybenko gave a simple proof using elementary functional analysis that a certain class of functions, called discriminatory functions, serve as valid activation functions for universal neural approximators. The objective of our project is three-fold:<br />
<br />
1) Prove a converse of Cybenko's Universal Approximation Theorem by means of the Stone-Weierstrass theorem<br />
<br />
2) Provide examples and non-examples of Cybenko's discriminatory functions<br />
<br />
3) Construct a neural network for functional data (i.e. data arising in function spaces) and prove a universal approximation theorem for Lp spaces.<br />
<br />
References:<br />
<br />
[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.<br />
<br />
[2] Folland, Gerald B. Real analysis: modern techniques and their applications. Vol. 40. John Wiley & Sons, 1999.<br />
<br />
[3] Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences, 4.</div>Jlavilez