http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Cfmeaney&feedformat=atomstatwiki - User contributions [US]2023-02-07T05:45:16ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48686CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-01T14:38:58Z<p>Cfmeaney: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conduct interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using only a single image with much data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images was not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION&diff=48685CRITICAL ANALYSIS OF SELF-SUPERVISION2020-12-01T14:37:46Z<p>Cfmeaney: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Maral Rasoolijaberi<br />
<br />
== Introduction ==<br />
<br />
This paper evaluated the performance of state-of-the-art self-supervised methods on learning weights of convolutional neural networks (CNNs) and on a per-layer basis. They were motivated by the fact that low-level features in the first layers of networks may not require the high-level semantic information captured by manual labels. This paper also aims to figure out whether current self-supervision techniques can learn deep features from only one image. <br />
<br />
The main goal of self-supervised learning is to take advantage of a vast amount of unlabeled data to train CNNs and find a generalized image representation. <br />
In self-supervised learning, unlabeled data generate ground truth labels per se by pretext tasks such as the Jigsaw puzzle task[6], and the rotation estimation[3]. For example, in the rotation task, we have a picture of a bird without the label "bird". We rotate the bird image by 90 degrees clockwise and the CNN is trained in a way that to find the rotation axis, as it can be seen in the figure below.<br />
<br />
[[File:self-sup-rotation.png|700px|center]]<br />
<br />
[[File:intro.png|500px|center]]<br />
<br />
== Previous Work ==<br />
<br />
In recent literature, several papers addressed self-supervised learning methods and learning from a single sample.<br />
<br />
A BiGAN [2], or Bidirectional GAN, is simply a generative adversarial network plus an encoder. The generator maps latent samples to generated data and the encoder performs as the opposite of the generator. After training BiGAN, the encoder has learned to generate a rich image representation. In RotNet method [3], images are rotated and the CNN learns to figure out the direction. DeepCluster [4] alternates k-means clustering to learn stable feature representations under several image transformations.<br />
<br />
== Method & Experiment ==<br />
<br />
In this paper, BiGAN, RotNet and DeepCluster are employed for training AlexNet in a self-supervised manner.<br />
To evaluate the impact of the size of the training set, they have compared the results of a million images in the ImageNet dataset with a million augmented images generated from only one single image. Various data augmentation methods including cropping, rotation, scaling, contrast changes, and adding noise, have been used to generate the mentioned artificial dataset from one image. <br />
<br />
With the intention of measuring the quality of deep features on a per-layer basis, a linear classifier is trained on top of each convolutional layer of AlexNet. Linear classifier probes are commonly used to monitor the features at every layer of a CNN, and are trained entirely independently of the CNN itself [5]. Note that the main purpose of CNNs is to reach a linearly discriminable representation for images. Accordingly, linear probing technique aims to evaluate the training of each layer of a CNN and inspect how much information each of the layers learned.<br />
The same experiment has been done using the CIFAR10/100 dataset.<br />
<br />
== Results ==<br />
<br />
<br />
Figure 2 shows how well representations at each level are linearly separable. The result table indicated the classification accuracy of the linear classifier trained on the top of each convolutional layer.<br />
According to results, training the CNN with self-supervision methods can match the performance of fully supervised learning in the first two convolutional layers. It must be pointed out that only one single image with massive augmentation is utilized in this experiment.<br />
[[File:histo.png|500px|center]]<br />
[[File:table_results_imageNet_SSL_2.png|500px|center]]<br />
<br />
== Source Code ==<br />
<br />
The source code for the paper can be found here: https://github.com/yukimasano/linear-probes<br />
<br />
== Conclusion ==<br />
<br />
In this paper, the authors conduct interesting experiments to show that the first few layers of CNNs contain only limited information for analyzing natural images. They saw this by examining the weights of the early layers in cases where they only trained using a small number of images with data augmentation. Specifically, sufficient data augmentation was enough to make up for a lack of data in early CNN layers. However, this technique was not able to elicit proper learning in deeper CNN layers. In fact, even millions of images was not enough to elicit proper learning without supervision. Accordingly, current unsupervised learning is only about augmentation, and we probably do not use the capacity of a million images, yet.<br />
<br />
== References ==<br />
<br />
<br />
[1] Y. Asano, C. Rupprecht, and A. Vedaldi, “A critical analysis of self-supervision, or what we can learn from a single image,” in International Conference on Learning Representations, 2019<br />
<br />
[2] J. Donahue, P. Kr ̈ahenb ̈uhl, and T. Darrell, “Adversarial feature learning,”arXiv preprint arXiv:1605.09782, 2016.<br />
<br />
[3] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,”arXiv preprintarXiv:1803.07728, 2018<br />
<br />
[4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 132–149<br />
<br />
[5] G. Alain and Y. Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016.<br />
<br />
[6] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=45482Model Agnostic Learning of Semantic Features2020-11-21T15:43:37Z<p>Cfmeaney: /* Critiques */</p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing a similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
a.k.a learning to learn is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using the below algorithm.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y} = - \sum_{c}1[y=C]log(\hat{y}_{c})) </math><br />
</div><br />
<br />
Although the task loss is a decent predictor nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. So the other loss terms are responsible for this aim.<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. And those relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimising symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symm.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|700px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. It contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, the authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. The authors attempted to segment the brain images into four classes: background, grey matter, white matter, and cerebrospinal fluid. Tasks such as these have enormous impact in clinical diagnosis and aiding in treatment. For example, designing a similar net to segment between healthy brain tissue and tumorous brain tissue could aid surgeons in brain tumour resection.<br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code is freely available at [19]. As future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== Critiques ==<br />
<br />
The purpose of this paper is to help guide learning in semantic feature space by leveraging local similarity. The authors argument may contain essential domain-independent general knowledge for domain generalization to solve this issue. In addition to adopting constructive loss and triplet loss to encourage the clustering for solving this issue. Extracting robust semantic features regardless of domains can be learned by leveraging from the across-domain class similarity information, which is important information during learning. The learner would suffer from indistinct decision boundaries if it could not separate the samples from different source domains with separation on the domain invariant feature space and in-dependent class-specific cohesion. The major problem that will be revealed with large datasets is that these indistinct decision boundaries might still be sensitive to the unseen target domain.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features&diff=45481Model Agnostic Learning of Semantic Features2020-11-21T15:41:55Z<p>Cfmeaney: /* Multi-site Brain MRI image segmentation */</p>
<hr />
<div>== Presented by ==<br />
Milad Sikaroudi<br />
<br />
== Introduction ==<br />
Transfer learning is a line of research in machine learning which focuses on storing knowledge from one domain (source domain) to solve a similar problem in another domain (target domain). In addition to regular transfer learning, one can use "transfer metric learning" in which through utilizing a similarity relationship between samples [1], [2] a more robust and discriminative data representation is formed. However, both of these kinds of techniques work insofar as the domain shift, between source and target domains, is negligible. Domain shift is defined as the deviation in the distribution of the source domain and the target domain and it would cause the DNN model to completely fail. The multi-domain learning is the solution when the assumption of "source domain and target domain comes from an almost same distribution" may not hold. There are two variants of MDL in the literature that can be confused, i.e. domain generalization, and domain adaptation; however in domain adaptation, we have access to the target domain data somehow, while that is not the case in domain generalization. This paper introduces a technique for domain generalization based on two complementary losses that regularize the semantic structure of the feature space through an episodic training scheme originally inspired by the model-agnostic meta-learning.<br />
<br />
== Previous Work ==<br />
<br />
Originated from model-agnostic meta-learning (MAML), episodic training has been vastly leveraged for addressing domain generalization [3, 4, 5, 7, 8, 6, 9, 10, 11]. The method of MLDG [4] closely follows MAML in terms of back-propagating the gradients from an ordinary task loss on meta-test data, but it has its own limitation as the use of the task objective might be sub-optimal since it only uses class probabilities. Most of the works [3,7] in the literature lack notable guidance from the semantics of feature space, which contains crucial domain-independent ‘general knowledge’ that can be useful for domain generalization. The authors claim that their method is orthogonal to previous works.<br />
<br />
<br />
=== Model Agnostic Meta Learning ===<br />
a.k.a learning to learn is a learning paradigm in which optimal initial weights are found incrementally (episodic training) by minimizing a loss function over some similar tasks (meta-train, meta-test sets). Imagine a 4-shot 2-class image classification task as below:<br />
[[File:p5.png|800px|center]]<br />
Each of the training tasks provides an optimal initial weight for the next round of the training. By considering all of these sets of updates and meta-test set, the updated weights are calculated using the below algorithm.<br />
[[File:algo1.PNG|500px|center]]<br />
<br />
== Method ==<br />
In domain generalization, we assume that there are some domain-invariant patterns in the inputs (e.g. semantic features). These features can be extracted to learn a predictor that performs well across seen and unseen domains. This paper assumes that there are inter-class relationships across domains. In total, the MASF is composed of a '''task loss''', '''global class alignment''' term and a '''local sample clustering''' term.<br />
<br />
=== Task loss ===<br />
<math> F_{\psi}: X \rightarrow Z</math> where <math> Z </math> is a feature space<br />
<math> T_{\theta}: X \rightarrow \mathbf {R}^{C}</math> where <math> C </math> is the number of classes in <math> Y </math><br />
Assume that <math>\hat{y}= softmax(T_{\theta}(F_{\psi}(x))) </math>. The parameters <math> (\psi, \theta) </math> are optimized with minimizing a cross-entropy loss namely <math> \mathbf{L}_{task} </math> formulated as:<br />
<br />
<div style="text-align: center;"><br />
<math> l_{task}(y, \hat{y} = - \sum_{c}1[y=C]log(\hat{y}_{c})) </math><br />
</div><br />
<br />
Although the task loss is a decent predictor nothing prevents the model from overfitting to the source domains and suffering from degradation on unseen test domains. So the other loss terms are responsible for this aim.<br />
<br />
=== Global class alignment ===<br />
In semantic space, we assume there are relationships between class concepts. And those relationships are invariant to changes in observation domains. Capturing and preserving such class relationships can help models generalize well on unseen data. To achieve this, a global layout of extracted features are imposed such that the relative locations of extracted features reflect their semantic similarity. Since <math> L_{task} </math> focuses only on the dominant hard label prediction, the inter-class alignment across domains is disregarded. Hence, minimising symmetrized Kullback–Leibler (KL) divergence across domains, averaged over all <math> C </math> classes has been used:<br />
<div style="text-align: center;"> <br />
<math> l_{global}(D_{i}, D{j}; \psi^{'}, \theta^{'}) = 1/C \sum_{c=1}^{C} 1/2[D_{KL}(s_{c}^{(i)}||s_{c}^{(j)}) + D_{KL}(s_{c}^{(j)}||s_{c}^{(i)})], </math><br />
</div><br />
The authors stated that symmetric divergences such as Jensen–Shannon (JS) showed no significant difference with KL over symm.<br />
<br />
=== Local cluster sampling ===<br />
<math> L_{global} </math> captures inter-class relationships, we also want to make semantic features close to each other locally. Explicit metric learning, i.e. contrastive or triplet losses, have been used to ensure that the semantic features, locally cluster according to only class labels, regardless of the domain. Contrastive loss takes two samples as input and makes samples of the same class closer while pushing away samples of different classes.<br />
[[File: contrastive.png | 400px]]<br />
<br />
Conversely, triplet loss takes three samples as input: one anchor, one positive, and one negative. Triplet loss tries to make relevant samples closer than irrelevant ones.<br />
<div style="text-align: center;"><br />
<math><br />
l_{triplet}^{a,p,n} = \sum_{i=1}^{b} \sum_{k=1}^{c-1} \sum_{\ell=1}^{c-1}\! [m\!+\!\|x_{i}\!- \!x_{k}\|_2^2 \!-\! \|x_{i}\!-\!x_{\ell}\|_2^2 ]_+,<br />
</math><br />
</div><br />
<br />
== Model agnostic learning of semantic features ==<br />
These losses are used in an episodic training scheme showed in the below figure:<br />
[[File:algo2.PNG|700px|center]]<br />
<br />
== Experiments ==<br />
The usefulness of the proposed method has been demonstrated using two common benchmark datasets for domain generalization, i.e. VLCS and PACS, alongside a real-world MRI medical imaging segmentation task. In all of their experiments, the AlexNet with ImageNet pre-trained weights has been utilized. <br />
<br />
=== VLCS ===<br />
VLCS[12] is an aggregation of images from four other datasets: PASCAL VOC2007 (V) [13], LabelMe (L) [14], Caltech (C) [15], and SUN09 (S) [16] <br />
leave-one-domain-out validation with randomly dividing each domain into 70% training and 30% test.<br />
<br />
<gallery><br />
File:p6.PNG|VLCS dataset<br />
</gallery><br />
<br />
Notably, MASF outperforms MLDG[4], in the table below on this dataset, indicating that semantic properties would provide superior performance with respect to purely highly-abstracted task loss on meta-test. "DeepAll" in the table is the case in which there is no domain generalization. In DeepAll case the class labels have been used only, regardless of the domain each sample would lie in. <br />
<br />
[[File:table1_masf.PNG|600px|center]]<br />
<br />
=== PACS ===<br />
The more challenging domain generalization benchmark with a significant domain shift is the PACS dataset [17]. It contains art painting, cartoon, photo, sketch domains with objects from seven classes: dog, elephant, giraffe, guitar, house, horse, person.<br />
<gallery><br />
File:p7_masf.jpg|PACS dataset sample<br />
</gallery> <br />
<br />
As you can see in the table below, MASF outperforms state of the art JiGen[18], MLDG[4], MetaReg[3], significantly. In addition, the best improvement has achieved (6.20%) when the unseen domain is "sketch", which requires more general knowledge about semantic concepts since it is different from other domains significantly.<br />
<br />
[[File:table2_masf.PNG|600px|center]]<br />
<br />
=== Ablation study over PACS===<br />
The ablation study over the PACS dataset shows the effectiveness of each loss term. <br />
[[File:table3_masf.PNG|600px|center]]<br />
<br />
=== Deeper Architectures ===<br />
For stronger baseline results, the authors have performed additional experiments using advanced deep residual architectures like ResNet-18 and ResNet-50. The below table shows strong and consistent improvements of MASF over the DeepAll baseline in all PACS splits for both network architectures. This suggests that the proposed algorithm is also beneficial for domain generalization with deeper feature extractors.<br />
[[File:Paper18_PacResults.PNG|600px|center]]<br />
<br />
=== Multi-site Brain MRI image segmentation === <br />
<br />
The effectiveness of the MASF has been also demonstrated using a segmentation task of MRI images gathering from four different clinical centers denoted as (Set-A, Set-B, Set-C, and Set-D). The domain shift, in this case, would occur due to differences in hardware, acquisition protocols, and many other factors, hindering translating learning-based methods to real clinical practice. The authors attempted to segment the brain images into four classes: background, grey matter, white matter, and cerebrospinal fluid. Tasks such as these have enormous impact in clinical diagnosis and aiding in treatment. For example, designing a similar net to segment between healthy brain tissue and tumorous brain tissue could aid surgeons in brain tumour resection.<br />
<br />
<gallery><br />
File:p8_masf.PNG|MRI dataset<br />
</gallery> <br />
<br />
<br />
The results showed the effectiveness of the MASF in comparison to not use domain generalization.<br />
[[File:table5_masf.PNG|300px|center]]<br />
<br />
== Conclusion ==<br />
<br />
A new domain generalization technique by taking the advantage of incorporating global and local constraints for learning semantic feature spaces presented which outperforms the state-of-the-art. The effectiveness of this method has been demonstrated using two domain generalization benchmarks, and a real clinical dataset (MRI image segmentation). The code is freely available at [19]. As future work, it would be interesting to integrate the proposed loss functions with other methods as they are orthogonal to each other and evaluate the benefit of doing so. Also, investigating the usage of the current learning procedure in the context of generative models would be an interesting research direction.<br />
<br />
== Critiques ==<br />
<br />
The purpose of this paper is to be guided the learning in semantic feature space by leveraging the local similarity, so the authors argued may contain essential domain-independent general knowledge for domain generalization to solve this issue. In addition to adopting constructive loss and triplet loss to encourage the clustering for solving this issue. Extracting robust semantic features regardless of domains can be learned by leveraging from the across-domain class similarity information, which is important information during learning. The leaner would suffer from indistinct decision boundaries if it could not separate the samples from different source domains with separation on the domain invariant feature space and in-dependent class-specific cohesion. The major problem that will be revealed with large datasets, is these indistinct decision boundaries might still be sensitive to the unseen target domain.<br />
<br />
== References ==<br />
<br />
[1]: Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop. Vol. 2. 2015.<br />
<br />
[2]: Hoffer, Elad, and Nir Ailon. "Deep metric learning using triplet network." International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015.<br />
<br />
[3]: Balaji, Yogesh, Swami Sankaranarayanan, and Rama Chellappa. "Metareg: Towards domain generalization using meta-regularization." Advances in Neural Information Processing Systems. 2018.<br />
<br />
[4]: Li, Da, et al. "Learning to generalize: Meta-learning for domain generalization." arXiv preprint arXiv:1710.03463 (2017).<br />
<br />
[5]: Li, Da, et al. "Episodic training for domain generalization." Proceedings of the IEEE International Conference on Computer Vision. 2019.<br />
<br />
[6]: Li, Haoliang, et al. "Domain generalization with adversarial feature learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.<br />
<br />
[7]: Li, Yiying, et al. "Feature-critic networks for heterogeneous domain generalization." arXiv preprint arXiv:1901.11448 (2019).<br />
<br />
[8]: Ghifary, Muhammad, et al. "Domain generalization for object recognition with multi-task autoencoders." Proceedings of the IEEE international conference on computer vision. 2015.<br />
<br />
[9]: Li, Ya, et al. "Deep domain generalization via conditional invariant adversarial networks." Proceedings of the European Conference on Computer Vision (ECCV). 2018<br />
<br />
[10]: Motiian, Saeid, et al. "Unified deep supervised domain adaptation and generalization." Proceedings of the IEEE International Conference on Computer Vision. 2017.<br />
<br />
[11]: Muandet, Krikamol, David Balduzzi, and Bernhard Schölkopf. "Domain generalization via invariant feature representation." International Conference on Machine Learning. 2013.<br />
<br />
[12]: Fang, Chen, Ye Xu, and Daniel N. Rockmore. "Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias." Proceedings of the IEEE International Conference on Computer Vision. 2013.<br />
<br />
[13]: Everingham, Mark, et al. "The pascal visual object classes (voc) challenge." International journal of computer vision 88.2 (2010): 303-338.<br />
<br />
[14]: Russell, Bryan C., et al. "LabelMe: a database and web-based tool for image annotation." International journal of computer vision 77.1-3 (2008): 157-173.<br />
<br />
[15]: Fei-Fei, Li. "Learning generative visual models from few training examples." Workshop on Generative-Model Based Vision, IEEE Proc. CVPR, 2004. 2004.<br />
<br />
[16]: Chopra, Sumit, Raia Hadsell, and Yann LeCun. "Learning a similarity metric discriminatively, with application to face verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.<br />
<br />
[17]: Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. "Deeper, broader and artier domain generalization". IEEE International Conference on Computer Vision (ICCV), 2017. <br />
<br />
[18]: Carlucci, Fabio M., et al. "Domain generalization by solving jigsaw puzzles." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.<br />
<br />
[19]: https://github.com/biomedia-mira/masf</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE&diff=45480SuperGLUE2020-11-21T15:21:51Z<p>Cfmeaney: /* Design Process */</p>
<hr />
<div><br />
== Presented by ==<br />
Shikhar Sakhuja<br />
<br />
== Introduction == <br />
Natural Language Processing (NLP) has seen immense improvements over the past two years. The improvements offered by RNN-based model such as ELMo [2], and Transformer [1] based models such as OpenAI GPT [3], BERT[4], etc. have revolutionized the field. These models render GLUE [5], the standard benchmark for NLP tasks, ineffective. The GLUE benchmark was released over a year ago and assessed NLP models using a single-number metric that summarized performance over some diverse tasks. However, the transformer-based models outperform the non-expert humans in several tasks. With transformer-based models achieving near-perfect scores on almost all tasks in GLUE and outperforming humans in some, there is a need for a new benchmark that involves harder and even more diverse language tasks. The authors release SuperGLUE as a new benchmark that has a more rigorous set of language understanding tasks. <br />
<br />
<br />
== Related Work == <br />
There have been several benchmarks attempting to standardize the field of language understanding tasks. SentEval [6] evaluated fixed-size sentence embeddings for tasks. DecaNLP [7] converts tasks into a general question-answering format. GLUE offers a much more flexible and extensible benchmark since it imposes no restrictions on model architectures or parameter sharing. <br />
<br />
GLUE has been the gold standard for language understanding tests since it’s release. In fact, the benchmark has promoted growth in language models with all the transformer-based models started with attempting to achieve high scores on GLUE. Original GPT and BERT models scored 72.8 and 80.2 on GLUE. The latest GPT and BERT models, however, far outperform these benchmarks and strike a need for a more robust and difficult benchmark. <br />
<br />
<br />
== Motivation ==<br />
Transformer based NLP models allow NLP models to train using transfer learning which was previously only seen in Computer Vision tasks and was notoriously difficult for language because of the discrete nature of words. Transfer Learning in NLP allows models to be trained over terabytes of language data in a self-supervised fashion. These models can then be finetuned for downstream tasks such as sentiment classification, fake news detection, etc. The fine-tuned models beat many of the human labellers who weren’t experts in the domain. Thus, creating a need for a newer, more robust baseline that can stay relevant with the rapid improvements in the field of NLP. <br />
<br />
[[File:loser glue.png]]<br />
<br />
Figure 1: Transformer-based models outperforming humans in GLUE tasks.<br />
<br />
== Design Process ==<br />
<br />
SuperGLUE is designed to be widely applicable to many different NLP tasks. That being said, in designing SuperGLUE, certain criteria needed to be established to determine whether a NLP task can be completed. The authors specified six such requirements, which are listed below.<br />
<br />
#'''Task substance:''' Tasks should test a system's reasoning and understanding of English text.<br />
#'''Task difficulty:''' Tasks should be solvable by those who graduated from an English postsecondary institution.<br />
#'''Evaluability:''' Tasks are required to have an automated performance metric that aligns to human judgements of the output quality.<br />
#'''Public data:''' Tasks need to have existing public data for training with a preference for an additional private test set.<br />
#'''Task format:''' Preference for tasks with simpler input and output formats to steer users of the benchmark away from tasks specific architectures.<br />
#'''License:''' Task data must be under a license that allows the redistribution and use for research.<br />
<br />
To select tasks that would be included in the benchmarks, the authors put of a public request for NLP tasks and received many. From this, they filtered the tasks according to the criteria above as well as eliminating any tasks that could not be used due to licensing issues or other problems.<br />
<br />
== SuperGLUE Tasks ==<br />
<br />
SuperGLUE has 8 language understanding tasks. They test a model’s understanding of texts in English. The tasks are built to be equivalent to the capabilities of most college-educated English speakers and are beyond the capabilities of most state-of-the-art systems today. <br />
<br />
'''BoolQ''' (Boolean Questions [9]): QA task consisting of short passage and related questions to the passage as either a yes or a no answer. <br />
<br />
'''CB''' (CommitmentBank [10]): Corpus of text where sentences have embedded clauses and sentences are written with the goal of keeping the clause accurate. <br />
<br />
'''COPA''' (Choice of plausible Alternatives [11]): Reasoning tasks in which given a sentence the system must be able to choose the cause or effect of the sentence from two potential choices. <br />
<br />
'''MultiRC''' (Multi-Sentence Reading Comprehension [12]): QA task in which given a passage and potential answers, the model should label the answers as true or false. <br />
<br />
'''ReCoRD''' (Reading Comprehension with Commonsense Reasoning Dataset [13]): A multiple-choice, question answering task, where given a passage with a masked entity, the model should be able to predict the masked out entity from the choices.<br />
<br />
'''RTE''' (Recognizing Textual Entailment [14]): Classifying whether a text that can be plausibly inferred from a given passage. <br />
<br />
'''WiC''' (Word in Context [15]): Identifying whether a polysemous word used in multiple sentences is being used with the same sense across sentences or not. <br />
<br />
'''WSC''' (Winograd Schema Challenge, [16]): A conference resolution task where sentences include pronouns and noun phrases from the sentence. The goal is to identify the correct reference to a noun phrase corresponding to the pronoun.<br />
<br />
== Model Analysis ==<br />
SuperGLUE includes two tasks for analyzing linguistic knowledge and gender bias in models. To analyze linguistic and world knowledge, submissions to SuperBLUE are required to include predictions of sentence pair relation (entailment, not_entailment) on the resulting set for RTE task. As for gender bias, SuperGLUE includes a diagnostic dataset Winogender, which measures gender bias in co-reference resolution systems. A poor bias score indicates gender bias, however, a good score does not necessarily mean a model is unbiased. This is one limitation of the dataset. <br />
<br />
<br />
== Results ==<br />
<br />
Table 1 offers a summary of the results from SuperGLUE across different models. CBOW baselines are generally close to roughly chance performance. BERT, on the other hand, increased the SuperGLUE score by 25 points and had the highest improvement on most tasks, especially MultiRCC, ReCoRD, and RTE. WSC is trickier for BERT, potentially owing to the small dataset size. <br />
<br />
BERT++[8] increases BERT’s performance even further. However, achieving the goal of the benchmark, the best model/score still lags behind compared to human performance. The human results for WiC, MltiRC, RTE, and ReCoRD were already available on [15], [12], [17], and [13] respectively. However, for the remaining tasks, the authors employed crowdworkers to reannotate a sample of each test set according to the methods used in [17]. The large gaps should be relatively tricky for models to close in on. The biggest margin is for WSC with 35 points and CV, RTE, BoolQ, WiC all have 10 point margins.<br />
<br />
<br />
[[File: 800px-SuperGLUE result.png]]<br />
<br />
Table 1: Baseline performance on SuperGLUE tasks.<br />
<br />
== Conclusion ==<br />
SuperGLUE fills the gap that GLUE has created owing to its inability to keep up with the SOTA in NLP. The new language tasks that the benchmark offers are built to be more robust and difficult to solve for NLP models. With the difference in model accuracy being around 10-35 points across all tasks, SuperGLUE is definitely going to be around for some time before the models catch up to it, as well. Overall, this is a significant contribution to improve general-purpose natural language understanding. <br />
<br />
== Critique == <br />
This is quite a fascinating read where the authors of the gold-standard benchmark have essentially conceded to the progress in NLP. Bowman’s team resorting to creating a new benchmark altogether to keep up with the rapid pace of increase in NLP makes me wonder if these benchmarks are inherently flawed. Applying the idea of Wittgenstein’s Ruler, are we measuring the performance of models using the benchmark, or the quality of benchmarks using the models? <br />
<br />
I’m curious how long SuperGLUE would stay relevant owing to advances in NLP. GPT-3, released in June 2020, has outperformed GPT-2 and BERT by a huge margin, given the 100x increase in parameters (175B Parameters over ~600GB for GPT-3, compared to 1.5B parameters over 40GB for GPT-2). In October 2020, a new deep learning technique (Pattern Exploiting Training) managed to train a Transformer NLP model with 223M parameters (roughly 0.01% parameters of GPT-3) and outperformed GPT-3 by 3 points on SuperGLUE. With the field improving so rapidly, I think superGLUE is nothing but a bandaid for the benchmarking tasks that will turn obsolete in no time.<br />
<br />
== References ==<br />
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.<br />
<br />
[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1202. URL https://www.aclweb.org/anthology/N18-1202<br />
<br />
[3] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018. Unpublished ms. available through a link at https://blog.openai.com/language-unsupervised/.<br />
<br />
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https: //arxiv.org/abs/1810.04805.<br />
<br />
[5] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019a. URL https://openreview.net/forum?id=rJ4km2R5t7.<br />
<br />
[6] Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th Language Resources and Evaluation Conference. European Language Resource Association, 2018. URL https://www.aclweb.org/anthology/L18-1269.<br />
<br />
[7] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information processing Systems (NeurIPS). Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf.<br />
<br />
[8] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088, 2018. URL https://arxiv.org/abs/1811.01088.<br />
<br />
[9] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936,2019a.<br />
<br />
[10] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. 2019. To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at https://github.com/mcdm/CommitmentBank/.<br />
<br />
[11] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.<br />
<br />
[12] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language technologies (NAACL-HLT). Association for Computational Linguistics, 2018. URL https://www.aclweb.org/anthology/papers/N/N18/N18-1023/.<br />
<br />
[13] Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885, 2018.<br />
<br />
[14] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment. Springer, 2006. URL https://link.springer.com/chapter/10.1007/11736790_9.<br />
<br />
[15] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: The word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1808.09121.<br />
<br />
[16] Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012. URL http://dl.acm.org/citation.cfm?id=3031843.3031909.<br />
<br />
[17] Nikita Nangia and Samuel R. Bowman. Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association of Computational Linguistics (ACL). Association for Computational Linguistics, 2019. URL https://woollysocks.github.io/assets/GLUE_Human_Baseline.pdf.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations&diff=44916ALBERT: A Lite BERT for Self-supervised Learning of Language Representations2020-11-16T16:26:22Z<p>Cfmeaney: /* Removing dropout */</p>
<hr />
<div>== Presented by == <br />
Maziar Dadbin<br />
<br />
==Introduction==<br />
In this paper, the authors have made some changes to the BERT model and the result is ALBERT, a model that out-performs BERT on GLUE, SQuAD, and RACE benchmarks. The important point is that ALBERT has fewer parameters than BERT-large, but it still produces better results. The changes made to BERT model are Factorized embedding parameterization and Cross-layer parameter sharing which are two methods of parameter reduction. They also introduced a new loss function and replaced it with one of the loss functions being used in BERT (i.e. NSP). The last change is removing dropouts from the model.<br />
<br />
== Motivation == <br />
In natural language representations, larger models often result in improved performance. However, at some point GPU/TPU memory and training time constraints limit our ability to increase the model size any further. There exist some attempts to reduce the memory consumption, but at the cost of speed. For example, Chen et al. (2016)[1] uses an extra forward pass, but reduces the memory requirements in a technique called gradient checkpointing. Moreover, Gomez et al. (2017)[2] leverages a method to reconstruct a layer's activations from its next layer, in order to eliminate the need to store these activations, freeing up the memory. In addition, Raffel et al. (2019)[3], leverage model parallelization while training a massive model. The authors of this paper claim that their parameter reduction techniques reduce memory consumption and increase training speed.<br />
<br />
==Model details==<br />
The fundamental structure of ALBERT is the same as BERT i.e. it uses transformer encoder with GELU nonlinearities. The authors set the feed-forward/filter size to be 4*H and the number of attention heads to be H/64 (where H is the size of the hidden layer). Next, we explain the changes the have been applied to the BERT.<br />
<br />
<br />
===Factorized embedding parameterization===<br />
In BERT (as well as subsequent models like XLNet and RoBERTa) we have <math display="inline">\\E</math>=<math display="inline">\\H</math> i.e. the size of the vocabulary embedding (<math display="inline">\\E</math>) and the size of the hidden layer (<math display="inline">\\H</math>) are tied together. This is not an efficient choice because we may need to have a large hidden layer but not a large vocabulary embedding layer. This is actually the case in many applications because the vocabulary embedding ‘<math display="inline">\\E</math>’ is meant to learn context-independent representations while the hidden-layer embedding ‘<math display="inline">\\H</math>’ is meant to learn context-dependent representation which usually is harder. However, if we increase <math display="inline">\\H</math> and <math display="inline">\\E</math> together, it will result in a huge increase in the number of parameters because the size of the vocabulary embedding matrix is <math display="inline">\\V \cdot E</math> where <math display="inline">\\V</math> is the size of the vocabulary and is usually quite large. For example, <math display="inline">\\V</math> equals 30000 in both BERT and ALBERT. <br />
The authors proposed the following solution to the problem:<br />
Do not project one-hot vectors directly into hidden space, instead first project one-hot vectors into a lower dimensional space of size <math display="inline">\\E</math> and then project it to the hidden layer. This reduces embedding parameters from <math display="inline">\\O(V \cdot H)</math> to <math display="inline"> \\O(V \cdot E+E \cdot H) </math> which is significant when <math display="inline">\\H</math> is much larger than <math display="inline">\\E</math>.<br />
<br />
===Cross-layer parameter sharing===<br />
Another method the authors used for reducing the number of parameters is to share the parameters across layers. There are different strategies for parameter sharing. For example, one may only share feed-forward network parameters or only share attention parameters. However, the default choice for ALBERT is to simply share all parameters across layers.<br />
The following table shows the effect of different parameter sharing strategies in two setting for the vocabulary embedding size. As we can see in both cases, sharing all the parameters have a negative effect on the accuracy where most of this effect comes from sharing the FFN parameters instead of the attention parameters. Given this, the authors have decided to share all the parameters across the layers which result in much smaller number of parameters which in turn enable them to have larger hidden layers and this is how they compensate what they have lost from parameter sharing. <br />
<br />
[[File:sharing.png | center |800px]]<br />
<br />
<br />
'''Why does cross-layer parameter sharing work?'''<br />
From the experiment results, we can see that cross-layer parameter sharing dramatically reduces the model size without hurting the accuracy too much. While it is obvious that sharing parameters can reduce the model size, it might be worth thinking about why parameters can be shared across BERT layers. Two of the authors briefly explained the reason in a blog. They noticed that the network often learned to perform similar operations at various layers (Soricut, Lan, 2019). Previous research also showed that attention heads in BERT behave similarly (Clark et al., 2019). These observations made it possible to use the same weights at different layers.<br />
<br />
===Inter-sentence coherence loss===<br />
<br />
The BERT uses two loss functions namely Masked language modelling (MLM) loss and Next-sentence prediction (NSP) loss. The NSP is a binary classification loss where positive examples are two consecutive segments from the training corpus and negative examples are pairing segments from different documents. The negative and positive examples are sampled with equal probability. However, experiments show that NSP is not effective, and it should also be pointed out that NSP loss overlaps with MLM loss in terms of the task in topic prediction. In fact, the necessity of NSP loss has been questioned in the literature (Lample and Conneau,2019; Joshi et al., 2019). The authors explained the reason as follows:<br />
A negative example in NSP is misaligned from both topic and coherence perspective. However, topic prediction is easier to learn compared to coherence prediction. Hence, the model ends up learning just the easier topic-prediction signal. For example, the model can easily be trained to learn "I love cats" and "I had sushi for lunch" are not coherent as they are already very different topic-wise, but might not be able to tell that "I love cats" and "my mom owned a dog" are not next to each other.<br />
They tried to solve this problem by introducing a new loss namely sentence order prediction (SOP) which is again a binary classification loss. Positive examples are the same as in NSP (two consecutive segments). But the negative examples are the same two consecutive segments with their order swapped. The SOP forces the model to learn the harder coherence prediction task. The following table compare NSP with SOP. As we can see the NSP cannot solve the SOP task (it performs at random 52%) but the SOP can solve the NSP task to an acceptable degree (78.9%). We also see that on average the SOP improves results on downstream tasks by almost 1%. Therefore, they decided to use MLM and SOP as the loss functions.<br />
<br />
<br />
<br />
[[File:SOPvsNSP.png | center |800px]]<br />
<br />
<br />
'''What does sentence order prediction (SOP) look like?'''<br />
<br />
'''Through a mathematical lens:'''<br />
<br />
First we will present some variable as follows. <math display="inline">\vec{s_{j}}</math> is the <math display="inline">j^{th}</math> textual segment in a document, <math display="inline"> D </math>. Here <math display="inline"> \vec{s_{j}} \in span \{ \vec{w^{j}_1}, ... , \vec{w^{j}_n} \} </math>. <math display="inline"> \vec{w^{j}_i} </math> is the <math display="inline">i^{th}</math> word in <math display="inline">\vec{s_{j}}</math>. Now the task of SOP is given <math display="inline">\vec{s_{k}}</math> to predict whether a following textual segment <math display="inline">\vec{s_{k+1}}</math> is truly the following sentence or not. Here the subscripts <math display="inline">k</math> and <math display="inline">k+1</math> denote the ordering. The task is predict whether <math display="inline">\vec{s_{k+1}}</math> is actually <math display="inline">\vec{s_{j+1}}</math> or <math display="inline">\vec{s_{j}}</math>.<br />
<br />
<br />
'''Through a visual lens:'''<br />
<br />
[[File:SOP.PNG | center | 800px]]<br />
<br />
===Removing dropout===<br />
The last change the authors applied to the BERT is that they removed the dropout. Table 8 below shows the effect of removing dropout. They also observe that the model does not overfit the data even after 1M steps of training. The authors point out that there is empirical [8] and theoretical [9] evidence suggesting that batch normalization in combination with dropout may have harmful results, particularly in convolutional neural networks. They speculate that dropout may be having a similar effect here.<br />
[[File:dropout.png | center |800px]]<br />
<br />
===Effect of Network Depth and Width===<br />
<br />
In table 11, we can see the effect of increasing the number of layers. In all these settings the size of hidden layers is 1024. It appears that with increasing the depth of the model we get better and better results until the number of layers reaches 24. However, it seems that increasing the depth from 24 to 48 will decline the performance of the model.<br />
<br />
[[File:ALBERT_table11.png | center |800px]]<br />
<br />
Table 12 shows the effect of the width of the model. It was observed that the accuracy of the model improved till the width of the network reaches 4096 and after that, any further increase in the width appears to have a decline in the accuracy of the model.<br />
[[File:ALBERT_table12.png | center |800px]]<br />
<br />
Table 13 investigates if we need a very deep model when the model is very wide. It seems that when we have H=4096, the difference between the performance of models with 12 or 24 layers is negligible. <br />
[[File:ALBERT_table13.png | center |800px]]<br />
<br />
These three tables illustrate the logic behind the decisions the authors have made about the width and depth of the model.<br />
<br />
==Conclusion==<br />
By looking at the following table we can see that ALBERT-xxlarge outperforms the BERT-large on all the downstream tasks. Note that the ALBERT-xxlarge uses a larger configuration (yet fewer number of parameters) than BERT-large and as a result it is about 3 times slower.<br />
<br />
[[File:result.png | center |800px]]<br />
<br />
==Critiques==<br />
The authors mentioned that we usually get better results if we train our model for a longer time. Therefore, they present a comparison in which they trained both ALBERT-xxlarge and BERT-large for the same amount of time instead of the same number of steps. Here are the results:<br />
[[File:sameTime.png | center |800px]]<br />
<br />
However, in my opinion, this is not a fair comparison to let the ALBERT-xxlarge to train for 125K step and say that the BERT-large will be trained for 400K steps in the same amount of time because after some number of training steps, additional steps will not improve the result by that much. It would be better if we could also look at the results when they let the BERT-large to be trained for 125K step and the ALBERT-xxlarge to be trained the same amount of time. I guess in that case the result was in favour of the BERT-large. Actually it would be nice if we could have a plot with the time on the horizontal and the accuracy on the vertical axis. Then we would probably see that at first the BERT-large is better but at some time point afterwards the ALBERT-xxlarge starts to give the higher accuracy.<br />
<br />
==Reference==<br />
[1]: Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.<br />
<br />
[2]: Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Backpropagation without storing activations. In Advances in neural information processing systems, pp. 2214–2224, 2017.<br />
<br />
[3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.<br />
<br />
[4]: Radu Soricut, Zhenzhong. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. 2019. URL https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html<br />
<br />
[5]: Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning. What Does BERT Look At? An Analysis of BERT's Attention. 2019. URL https://arxiv.org/abs/1906.04341<br />
<br />
[6]: Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. 2019. URL https://arxiv.org/abs/1907.10529<br />
<br />
[7]: Guillaume Lample and Alexis Conneau. Crosslingual language model pretraining. 2019. URL https://arxiv.org/abs/1901.07291<br />
<br />
[8]: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.<br />
<br />
[9]: Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690, 2019</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering&diff=44806Dense Passage Retrieval for Open-Domain Question Answering2020-11-16T03:14:55Z<p>Cfmeaney: /* Main Results */</p>
<hr />
<div>= Presented by =<br />
Nicole Yan<br />
<br />
= Introduction =<br />
Open domain question answering is a task that finds question answers from a large collection of documents. Nowadays open domain QA systems usually use a two-stage framework: (1) a ''retriever'' that selects a subset of documents, and (2) a ''reader'' that fully reads the document subset and selects the answer spans. Stage one (1) is usually done through bag-of-words models, which count overlapping words and their frequencies in documents. Each document is represented by a high-dimensional, sparse vector. A common bag-of-words method that has been used for years is BM25, which ranks all documents based on the query terms appearing in each document. Stage one produces a small subset of documents where the answer might appear, and then in stage two, a reader would read the subset and locate the answer spans. Stage two is usually done through neural models, like Bert. While stage two benefits a lot from the recent advancement of neural language models, stage one still relies on traditional term-based models. This paper tries to improve stage one by using dense retrieval methods that generate dense, latent semantic document embedding, and demonstrates that dense retrieval methods can not only outperform BM25, but also improve the end-to-end QA accuracies. <br />
<br />
= Background =<br />
The following example clearly shows what problems open domain QA systems tackle. Given a question: "What is Uranus?", a system should find the answer spans from a large corpus. The corpus size can be billions of documents. In stage one, a retriever would select a small set of potentially relevant documents, which then would be fed to a neural reader in stage two for the answer spans extraction. Only a filtered subset of documents is processed by a neural reader since neural reading comprehension is expensive. It's impractical to process billions of documents using a neural reader.<br />
<br />
= Dense Passage Retriever =<br />
This paper focuses on improving the retrieval component and proposed a framework called Dense Passage Retriever (DPR) which aims to efficiently retrieve the top K most relevant passages from a large passage collection. The key component of DPR is a dual-BERT model which encodes queries and passages in a vector space where relevant pairs of queries and passages are closer than irrelevant ones. <br />
<br />
== Model Architecture Overview ==<br />
DPR has two independent BERT encoders: a query encoder Eq, and a passage encoder Ep. They map each input sentence to a d dimensional real-valued vector, and the similarity between a query and a passage is defined as the dot product of their vectors. DPR uses the [CLS] token output as embedding vectors, so d = 768.<br />
<br />
== Training ==<br />
The training data can be viewed as m instances of (query, positive passage, negative passages) pairs. The loss function is defined as the negative log likelihood of the positive passages. [[File: dpr_loss_fn.png | 400px]]<br />
<br />
While positive passage selection is simple, where the passage contains the answer is selected, negative passage selection is less explicit. The authors experimented with three types of negative passages: (1) Random passage from the corpus; (2) false positive passages returned by BM25; (3) Gold positive passages from the training set — i.e., a positive passage for one query is considered as a negative passage for another query. The authors got the best model by using gold positive passages from the same batch as negatives. This trick is called in-batch negatives. Assume there are B pairs of (query q_i, positive passage p_i) in a mini-batch, then the negative passages for query q_i are the passages p_j when j is not equal to i. <br />
<br />
= Experimental Setup =<br />
The authors pre-processed Wikipedia documents and split each document into passages of length 100 words. These passages form a candidate pool. Five QA datasets are used: Natural Questions (NQ), TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1. To build the training data, the authors match each question in the five datasets with a passage that contains the correct answer. The dataset statistics are summarized below. <br />
<br />
[[File: DPR_datasets.png | 600px]]<br />
<br />
= Retrieval Performance Evaluation =<br />
The authors trained DPR on five datasets separately, and on the combined dataset. They compared DPR performance with the performance of the term-frequency based model BM25, and BM25+DPR. The DPR performance is evaluated in terms of the retrieval accuracy, ablation study, qualitative analysis, and run-time efficiency.<br />
<br />
== Main Results ==<br />
<br />
The table below compares the top-k (for k=20 or k=100) accuracies of different retrieval systems on various popular QA datasets. Top-k accuracy is the percentage of examples in the test set for which the correct outcome occurs in the k most likely outcomes as predicted by the network. As can be seen, DPR consistently outperforms BM25 on all datasets, with the exception of SQuAD. Additionally, DPR tends to perform particularly well when a smaller k value in chosen. The authors speculated that the lower performance of DPR on SQuAD was for two reasons inherent to the dataset itself. First, the SQuAD dataset has high lexical overlap between passages and questions. Second, the dataset is biased because it is collected from a small number of Wikipedia articles - a point which has been argued by other researchers as well.<br />
<br />
[[ File: retrieval_accuracy.png | 800px]]<br />
<br />
== Ablation Study on Model Training ==<br />
The authors further analyzed how different training options would influence the model performance. The five training options they studied are (1) Sample efficiency, (2) In-batch negative training, (3) Impact of gold passages, (4) Similarity and loss, and (5) Cross-dataset generalization. <br />
<br />
(1) '''Sample efficiency''' <br />
<br />
The authors examined how many training examples were needed to achieve good performance. The study showed that DPR trained with 1k examples already outperformed BM25. With more training data, DPR performs better.<br />
<br />
[[File: sample_efficiency.png | 500px ]]<br />
<br />
(2) '''In-batch negative training'''<br />
<br />
Three training schemes are evaluated on the development dataset. The first scheme, which is the standard 1-to-N training setting, is to pair each query with one positive passage and n negative passages. As mentioned before, there are three ways to select negative passages: random, BM25, and gold. The results showed that in this setting, the choices of negative passages did not have strong impact on the model performance. The top block of the table below shows the retrieval accuracy in the 1-to-N setting. The second scheme, which is called in-batch negative setting, is to use positive passages for other queries in the same batch as negatives. The middle block shows the in-batch negative training results. The performance is significantly improved compared to the first setting. The last scheme is to augment in-batch negative with addition hard negative passages that are high ranked by BM25, but do not contain the correct answer. The bottom block shows the result for this setting, and the authors found that adding one addition BM25 hard negatives works the best.<br />
<br />
[[File: training_scheme.png | 500px]]<br />
<br />
(3) '''Similarity and loss'''<br />
In this paper, the similarity between a query and a passage is measured by the dot product of the two vectors, and the loss function is defined as the negative log likelihood of the positive passages. The authors experimented with other similarity functions such as L2-norm and cosine distance, and other loss functions such as triplet loss. The results showed these options didn't improve the model performance much.<br />
<br />
(4) '''Cross-dataset generalization'''<br />
Cross-dataset generalization studies if a model trained on one dataset can perform well on other unseen datasets. The authors trained DPR on Natural Questions dataset and tested it on WebQuestions and CuratedTREC. The result showed DPR generalized well, with only 3~5 points loss. <br />
<br />
<br />
== Qualitative Analysis ==<br />
Since BM25 is a bag-of-words method which ranks passages based on term-frequency, it's good at exact keywords and phrase matching, while DPR can capture lexical variants and semantic relationships. Generally DPR outperforms BM25 on the test sets. <br />
<br />
<br />
== Run-time Efficiency ==<br />
The time required for generating dense embeddings and indexing passages is long. It took the authors 8.8 hours to encode 21-million passages on 8 GPUs, and 8.5 hours to index 21-million passages on a singe server. Conversely, building inverted index for 21-million passages only takes about 30 minutes. However, after the pre-processing is done, DPR can process 995 queries per second, while BM25 processes 23.7 queries per second.<br />
<br />
<br />
= Experiments: Question Answering =<br />
The authors evaluated DPR on end-to-end QA systems (i.e., retriever + neural reader). The results showed that higher retriever accuracy typically leads to better final QA results, and the passages retrieved by DPR are more likely to contain the correct answers. As shown in the table below, QA systems using DPR generally perform better, except for SQuAD.<br />
<br />
[[File: QA.png | 600px]]<br />
<br />
<br />
= Conclusion =<br />
In conclusion, this paper proposed a dense retrieval method which can generally outperform traditional bag-of-words methods in open domain question answering tasks. Dense retrieval methods make use of pre-training language model Bert and achieved state-of-art performance on many QA datasets. Dense retrieval methods are good at capturing word variants and semantic relationships, but are relatively weak at capturing exact keyword match. This paper also covers a few training techniques that are required to successfully train a dense retriever. Overall, learning dense representations for first stage retrieval can potentially improve QA system performance, and has received more attention. <br />
<br />
= References =<br />
[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=44787DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-16T01:08:01Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3,4]. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=44786DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-16T01:07:44Z<p>Cfmeaney: /* References */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3],[4] One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.<br />
<br />
[3] Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6), 26–38.<br />
<br />
[4] Nian, R., Liu, J., & Huang, B. (2020). A review On reinforcement learning: Introduction and applications in industrial process control. Computers and Chemical Engineering, 139, 106886.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=44785DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-16T01:07:04Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [3],[4] One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=44784DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-16T01:06:51Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. For recent reviews of reinforcement learning, see [1],[2] One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=44783DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-16T01:02:58Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to accomplish tasks which may not have been seen in prior experiences. One way to achieve this is to represent the world based on past experiences. In this paper, the authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION&diff=44781DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION2020-11-16T00:58:16Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div>== Presented by == <br />
Bowen You<br />
<br />
== Introduction == <br />
<br />
Reinforcement learning refers to training a neural network to make a series of decisions dependent on a complex, evolving environment. Typically, this is accomplished by 'rewarding' or 'penalizing' the network based on its behaviors over time. Intelligent agents are able to<br />
<br />
In the general reinforcement learning framework, one typically trains an agent to learn complex behaviors. Intelligent agents are able to accomplish complex tasks that have never been seen in prior experience. One way to achieve this is by building a representation of the world based on past experiences. The authors propose an agent that learns long-horizon behaviors purely by latent imagination and outperforms previous agents in terms of data efficiency, computation time, and final performance.<br />
<br />
=== Preliminaries ===<br />
<br />
This section aims to define a few key concepts in reinforcement learning. In the typical reinforcement problem, an <b>agent</b> interacts with the <b>environment</b>. The environment is typically defined by a <b>model</b> that may or may not be known. The environment may be characterized by its <b>state</b> <math display="inline"> s \in \mathcal{S}</math>. The agent may choose to take <b>actions</b> <math display="inline"> a \in \mathcal{A}</math> to interact with the environment. Once an action is taken, the environment returns a <b>reward</b> <math display="inline"> r \in \mathcal{R}</math>as feedback.<br />
<br />
The actions an agent decides to take is defined by a <b>policy</b> function <math display="inline"> \pi : \mathcal{S} \to \mathcal{A}</math>. <br />
Additionally we define functions <math display="inline"> V_{\pi} : \mathcal{S} \to \mathbb{R} \in \mathcal{S}</math> and <math display="inline"> Q_{\pi} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}</math> to represent the value function and action-value functions of a given policy <math display="inline">\pi</math> respectively.<br />
<br />
Thus the goal is to find an optimal policy <math display="inline">\pi_{*}</math> such that <br />
\[<br />
\pi_{*} = \arg\max_{\pi} V_{\pi}(s) = \arg\max_{\pi} Q_{\pi}(s, a)<br />
\]<br />
<br />
=== Feedback Loop ===<br />
<br />
Given this framework, agents are able to interact with the environment in a sequential fashion, namely a sequence of actions, states, and rewards. Let <math display="inline"> S_t, A_t, R_t</math> denote the state, action, and reward obtained at time <math display="inline"> t = 1, 2, \ldots, T</math>. We call the tuple <math display="inline">(S_t, A_t, R_t)</math> one <b>episode</b>. This can be thought of as a feedback loop or a sequence<br />
\[<br />
S_1, A_1, R_1, S_2, A_2, R_2, \ldots, S_T<br />
\]<br />
<br />
== Motivation ==<br />
<br />
In many problems, the amount of actions an agent is able to take is limited. Then it is difficult to interact with the environment to learn an accurate representation of the world. The proposed method in this paper aims to solve this problem by "imagining" the state and reward that the action will provide. That is, given a state <math display="inline">S_t</math>, the proposed method generates <br />
\[<br />
\hat{A}_t, \hat{R}_t, \hat{S}_{t+1}, \ldots<br />
\]<br />
<br />
By doing this, an agent is able to plan-ahead and perceive a representation of the environment without interacting with it. Once an action is made, the agent is able to update their representation of the world by the actual observation. This is particularly useful in applications where experience is not easily obtained. <br />
<br />
== Dreamer == <br />
<br />
The authors of the paper call their method Dreamer. It consists of:<br />
* Representation <math display="inline">p_{\theta}(s_t | s_{t-1}, a_{t-1}, o_{t}) </math><br />
* Transition <math display="inline">q_{\theta}(s_t | s_{t-1}, a_{t-1}) </math><br />
* Reward <math display="inline"> q_{\theta}(r_t | s_t)</math><br />
* Action <math display="inline"> q_{\phi}(a_t | s_t)</math><br />
* Value <math display="inline"> v_{\psi}(s_t)</math><br />
<br />
where <math display="inline"> \theta, \phi, \psi</math> are learned neural network parameters.<br />
<br />
There are three main components to the proposed algorithm:<br />
* Dynamics Learning: Using past experience data, the agent learns to encode observations and actions into latent states and predicts environment rewards. One way to do this is via representation learning.<br />
* Behavior Learning: In the latent space, the agent predicts state values and actions that maximize the future rewards through back-propagation.<br />
* Environment Interaction: The agent encodes the episode to compute the current model state and predict the next action to interact with the environment.<br />
<br />
The proposed algorithm is described below.<br />
<br />
[[File:dreamer.png|frameless|500px|Dreamer algorithm]]<br />
<br />
Notice that there are three neural networks that are trained simultaneously. <br />
The neural networks with parameters <math display="inline"> \theta, \phi, \psi </math> correspond to models of the environment, action and values respectively.<br />
<br />
== Results ==<br />
<br />
The figure below summarizes the performance of Dreamer compared to other state-of-the-art reinforcement learning agents for continuous control tasks. Overall, it achieves the most consistent performance among them. Additionally, while other agents heavily rely on prior experience, Dreamer is able to learn behaviors with minimal interactions with the environment.<br />
<br />
[[File:scores.png|frameless|500px|Comparison of RL-agents against several continuous control tasks]]<br />
<br />
== Conclusion ==<br />
<br />
This paper presented a new algorithm for training reinforcement learning agents with minimal interactions with the environment. The algorithm outperforms many previous algorithms in terms of computation time and overall performance. This has many practical applications as many agents rely on prior experience which may be hard to obtain in the real-world. Although it may be an extreme example, consider a reinforcement learning agent that learns how to perform rare surgeries may not have enough data samples. This paper shows that it is possible to train agents without requiring many prior interactions with the environment.<br />
<br />
== References ==<br />
[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations (ICLR), 2020.<br />
<br />
[2] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44725Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:42:47Z<p>Cfmeaney: /* Approach */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantee that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For a recent review on adversarial attacks and more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44724Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:42:07Z<p>Cfmeaney: /* References */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantee that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.<br />
<br />
[1] Xu, H., Ma, Y., Liu, H. C., Deb, D., Liu, H., Tang, J. L., & Jain, A. K. (2020). Adversarial Attacks and Defenses in Images, Graphs and Text: A Review. International Journal of Automation and Computing, 17(2), 151–178.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44722Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:41:41Z<p>Cfmeaney: /* Approach */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantee that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image. For more information of PGD attacks, see [1].<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44714Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:32:24Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantee that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44713Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:31:47Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks that are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates&diff=44712Breaking Certified Defenses: Semantic Adversarial Examples With Spoofed Robustness Certificates2020-11-15T22:31:16Z<p>Cfmeaney: /* Introduction */</p>
<hr />
<div><br />
== Presented By ==<br />
Gaurav Sikri<br />
<br />
== Background ==<br />
<br />
Adversarial examples are inputs to machine learning or deep neural network models that an attacker intentionally designs to deceive the model or to cause the model to make a wrong prediction. This is done by adding a little noise to the original image or perturbing an original image and creating an image that is not identified by the network and therefore, the model misclassifies the new image. The following image describes an adversarial attack where a model is deceived by an attacker by adding a small noise to an input image and as a result, the prediction of the model changes.<br />
<br />
[[File:adversarial_example.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 1:''' Adversarial Example </div><br />
<br />
The impacts of adversarial attacks can be life-threatening in the real world. Consider the case of driverless cars where the model installed in a car is trying to read a STOP sign on the road. However, if the STOP sign is replaced by an adversarial image of the original image, and if that new image is able to fool the model to not make a decision to stop, it can lead to an accident. Hence it becomes really important to design the classifiers such that these classifiers are immune to such adversarial attacks.<br />
<br />
While training a deep network, the network is trained on a set of augmented images along with the original images. For any given image, there are multiple augmented images created and passed to the network to ensure that a model is able to learn from the augmented images as well. During the validation phase, after labeling an image, the defenses check whether there exists an image of a different label within a region of a certain unit radius of the input. If the classifier assigns all images within the specified region ball the same class label, then a certificate is issued. This certificate ensures that the model is protected from adversarial attacks and is called Certified Defense. The image below shows a certified region (in red)<br />
<br />
[[File:certified_defense.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Figure 2:''' Certified Defense Illustration </div><br />
<br />
== Introduction ==<br />
Conventional deep learning models are generally highly sensitive to adversarial perturbations (Szegedy et al., 2013) in a way that natural-looking but minutely augmented images have been able to manipulate those models by causing misclassifications. While in the last few years, several defenses have been built that protect neural networks against such attacks (Madry et al., 2017; Shafahi et al., 2019), but the defenses are based on heuristics and tricks are often easily breakable (Athalye et al. 2018). This has motivated a lot of researchers to work on certifiably secure networks — classifiers that produce a label for an image, and at the same time guarantees that the input is not adversarially manipulated. Most of the certified defenses created so far focus on deflecting <math>l_\text{p}</math>-bounded attacks where <math>p</math> = 2 or infinity.<br />
<br />
In this paper, the authors have demonstrated that a system that relies on certificates as a measure of label security can be exploited. The whole idea of the paper is to show that even though the system has a certified defense mechanism, it does not guarantee security against adversarial attacks. This is done by presenting a new class of adversarial examples that target not only the classifier output label but also the certificate. The first step is to add adversarial perturbations to images that are large in the <math>l_\text{p}</math>-norm (larger than the radius of the certificate region of the original image) and produce attack images that are outside the certificate boundary of the original image certificate and has images of the same (wrong) label. The result is a 'spoofed' certificate with a seemingly strong security guarantee despite being adversarially manipulated.<br />
<br />
The following three conditions should be met while creating adversarial examples:<br />
<br />
'''1. Imperceptibility: the adversarial image looks like the original example.<br />
<br />
'''2. Misclassification: the certified classifier assigns an incorrect label to the adversarial example.<br />
<br />
'''3. Strongly certified: the certified classifier provides a strong radius certificate for the adversarial example.<br />
<br />
The main focus of the paper is to attack the certificate of the model. The authors argue that the model can be attacked, no matter how strong the certificate of the model is.<br />
<br />
== Approach ==<br />
The approach used by the authors in this paper is 'Shadow Attack', which is a generalization of the well known PGD attack. The fundamental idea of the PGD attack is the same where a bunch of adversarial images are created in order to fool the network to make a wrong prediction. PGD attack solves the following optimization problem where <math>L</math> is the classification loss and the constraint corresponds to the minimal change done to the input image.<br />
<br />
\begin{align}<br />
max_{\delta }L\left ( \theta, x + \delta \right ) \tag{1} \label{eq:op}<br />
\end{align}<br />
<br />
\begin{align}<br />
s.t. \left \|\delta \right \|_{p} \leq \epsilon <br />
\end{align}<br />
<br />
Shadow attack on the other hand targets the certificate of the defenses by creating a new 'spoofed' certificate outside the certificate region of the input image. Shadow attack solves the following optimization problem where <math>C</math>, <math>TV</math>, and <math>Dissim</math> are the regularizers.<br />
<br />
\begin{align}<br />
max_{\delta} L\left (\theta ,x+\delta \right ) - \lambda_{c}C\left (\delta \right )-\lambda_{tv}TV\left ( \delta \right )-\lambda_{s}Dissim\left ( \delta \right ) \tag{2} \label{eq:op1}<br />
\end{align}<br />
<br />
<br />
In equation \eqref{eq:op1}, <math>C</math> in the above equation corresponds to the color regularizer which makes sure that minimal changes are made to the color of the input image. <math>TV</math> corresponds to the Total Variation or smoothness parameter which makes sure that the smoothness of the newly created image is maintained. <math>Dissim</math> corresponds to the similarity parameter which makes sure that all the color channels (RGB) are changed equally.<br />
<br />
The perturbations created in the original images are - <br />
<br />
'''1. small<br />
<br />
'''2. smooth<br />
<br />
'''3. without dramatic color changes<br />
<br />
There are two ways to ensure that this dissimilarity will not happen or will be very low and the authors have shown that both of these methods are effective. <br />
* 1-channel attack: This strictly enforces <math>\delta_{R,i} \approx \delta_{G,i} \approx \delta_{B,i} \forall i </math> i.e. for each pixel, the perturbations of all channels are equal and there will be <math> \delta_{ W \times H} </math>, where the size of the image is <math>3 \times W \times H</math> as the preturbation. In this case, <math>Dissim(\delta)=0 </math>. <br />
<br />
* 3-channel attack: In this kind of attack, the perturbations in different channels of a pixel are not equal and it uses <math> \delta_{3 \times W \times H} </math> with the <math>Dissim(\delta) = || \delta_{R}- \delta_{B}||_p + || \delta_{G}- \delta_{B}||_p +|| \delta_{R}- \delta_{G}||_p </math> as the dissimilarity cost function.<br />
<br />
== Ablation Study of the Attack parameters==<br />
In order to determine the required number of SGD steps, the effect of <math> \lambda_s</math>, and the importance of <math> \lambda_s</math> on the each losses in the cost function, the authors have tried different values of these parameters using the first example from each class of the CIFAR-10 validation set. Based on figure 4, 5, and 6, we can see that the <math>L(\delta)</math> (classification loss), <math>TV(\delta)</math> (Total Variation loss), <math>C(\delta)</math> (color regularizer) will converge to zero with 10 SGD steps. Note that since only 1-channel attack was used in this part of the experiment the <math>dissim(\delta)</math>was indeed zero. <br />
In figure 6 and 7, we can see the effect of <math>\lambda_s</math> on the dissimilarity loss and the effect of <math>\lambda_{tv}</math> on the total variation loss respectively. <br />
<br />
[[File:Ablation.png|500px|center|Image: 500 pixels]]<br />
<br />
== Experiments ==<br />
The authors used two experiments to prove that their approach to attack a certified model was actually able to break those defenses. The datasets used for both of these experiments were CIFAR10 and ImageNet dataset.<br />
<br />
=== Attack on Randomized Smoothing ===<br />
Randomized Smoothing is an adversarial defense against <math>l_\text{p}</math>-norm bounded attacks. The deep neural network model is trained on a randomly augmented batch of images. Perturbations are made to the original image such that they satisfy the previously defined conditions and spoof certificates are generated for an incorrect class by generating multiple adversarial images.<br />
<br />
The following table shows the results of applying the 'Shadow Attack' approach to Randomized Smoothing - <br />
<br />
[[File:ran_smoothing.png|600px|center|Image: 600 pixels]]<br />
<br />
<br />
<div align="center">'''Table 1 :''' Certified radii produced by the Randomized Smoothing method for Shadow Attack images<br />
and also natural images (larger radii means a stronger/more confident certificate) </div><br />
<br />
The third and the fifth column correspond to the mean radius of the certified region of the original image and the mean radius of the spoof certificate of the perturbed images, respectively. It was observed that the mean radius of the certificate of adversarial images was greater than the mean radius of the original image certificate. This proves that the 'Shadow Attack' approach was successful in creating spoof certificates of greater radius and with the wrong label. This also proves that the approach used in the paper was successful in breaking the certified defenses.<br />
<br />
=== Attack on CROWN-IBP ===<br />
Crown IBP is an adversarial defense against <math>l_\text{inf}</math>-norm bounded attacks. The same approach was applied for the CROWN-IBP defense and the table below shows the results.<br />
<br />
[[File:crown_ibp.png|500px|center|Image: 500 pixels]]<br />
<div align="center">'''Table 2 :''' “Robust error” for natural images, and “attack error” for Shadow Attack images using the<br />
CIFAR-10 dataset, and CROWN-IBP models. Smaller is better.) </div><br />
<br />
<br />
The above table shows the robustness errors in the case of the CROWN-IBP method and the attack images. It is seen that the errors in the case of the attack were less than the equivalent errors for CROWN-IBP, which suggests that the authors' 'Shadow Attack' approach was successful in breaking the <math>l_\text{inf}</math>-norm certified defenses as well.<br />
<br />
== Conclusion ==<br />
From the above approach used in a couple of experiments, we can conclude that it is possible to produce adversarial examples with ‘spoofed’ certified robustness by using large-norm perturbations. The perturbations generated are smooth and natural-looking while being large enough in norm to escape the certification regions of state-of-the-art principled defenses. The major takeaway of the paper would be that the certificates produced by certifiably robust classifiers are not always good indicators of robustness or accuracy.<br />
== Critiques==<br />
<br />
It is noticeable in this paper that using the mathematical formulation of the defenses and certifications is considered a weak method, whereas the constraint is imposed by <math> l_{p} </math> as assumed in equation \eqref{eq:op}. The top models can not achieve certifications beyond <math> \epsilon = 0.3 </math> disturbance in <math> l_{2} </math> norm, while disturbances <math> \epsilon = 4 </math> added to the target input are barely noticeable by human eyes, and <math> \epsilon = 100 </math> , when applied to the original image are still easily classified by humans as belonging to the same class. As discussed by many authors, the perception of multi-dimensional space by human eyes goes beyond what the <math> l_{p} </math> norm is capable of capturing and synthesizing. It is yet to be proposed more comprehensive metrics and algorithms capable of capturing the correlation between pixels of an image or input data which can better translate to optimization algorithms how humans distinguish features of an input image. Such a metric would allow the optimization algorithms to have better intuition on the subtle variations introduced by adversaries in the input data.<br />
<br />
== References ==<br />
Christian Szegedy,Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.<br />
<br />
Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.<br />
<br />
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.<br />
<br />
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=44705a fair comparison of graph neural networks for graph classification2020-11-15T22:18:45Z<p>Cfmeaney: /* Background */</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
<br />
Experimental reproducibility in machine learning has been known to be an issue for some time. Researchers attempting to reproduce the results of old algorithms have some up short, raising concerns that lack of reproducibility hurts the quality of the field. Lack of open source AI code has only exacerbated this, leading some to go so far as to say that "AI faces a reproducibility crisis" [1]. It has been argued that the ability to reproduce existing AI code, and making these codes and new ones open source is a key step in lowering the socio-economic barriers of entry into data science and computing. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
<br />
<br />
In the above figure we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
Also as well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.<br />
<br />
[1] Hutson, M. (2018). Artificial intelligence faces reproducibility crisis. Science, 359(6377), 725–726.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=44703a fair comparison of graph neural networks for graph classification2020-11-15T22:06:05Z<p>Cfmeaney: /* References */</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
Experimental reproducibility and replicability are critical topics in machine learning.<br />
Authors have often raised concerns about their lack in scientific publications<br />
to improve the quality of the field. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models which take graph-structured data as input, and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assesment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross Validation.<br />
As model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Reproducibility Issues==<br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
<br />
<br />
In the above figure we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
Also as well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.<br />
<br />
[1] Hutson, M. (2018). Artificial intelligence faces reproducibility crisis. Science, 359(6377), 725–726.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=44691orthogonal gradient descent for continual learning2020-11-15T21:49:45Z<p>Cfmeaney: /* Previous Work */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject which can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some importance measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. <br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state of the art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). Three tasks are compared: permuted MNIST, rotated MNIST, and split MNIT. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following figures show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not effect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future study based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly. all of the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under tasks dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when large learning rate are desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=44690orthogonal gradient descent for continual learning2020-11-15T21:49:30Z<p>Cfmeaney: /* References */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject which can help to get acquainted with the subject ([9], [10], [11] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some importance measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. <br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state of the art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). Three tasks are compared: permuted MNIST, rotated MNIST, and split MNIT. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following figures show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not effect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future study based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly. all of the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under tasks dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when large learning rate are desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=44688orthogonal gradient descent for continual learning2020-11-15T21:48:13Z<p>Cfmeaney: /* Previous Work */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject which can help to get acquainted with the subject ([9], [10], [11] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some importance measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. <br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state of the art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). Three tasks are compared: permuted MNIST, rotated MNIST, and split MNIT. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following figures show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not effect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future study based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly. all of the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under tasks dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when large learning rate are desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=44687orthogonal gradient descent for continual learning2020-11-15T21:48:00Z<p>Cfmeaney: /* Previous Work */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject which can help to get acquainted with the subject ([5], [6], [7] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some importance measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. <br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state of the art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). Three tasks are compared: permuted MNIST, rotated MNIST, and split MNIT. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following figures show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not effect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future study based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly. all of the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under tasks dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when large learning rate are desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=44679stat940F212020-11-15T21:12:09Z<p>Cfmeaney: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=44675stat940F212020-11-15T21:06:49Z<p>Cfmeaney: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]] ||<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || ||<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || ||<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] ||<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] ||<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || ||<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] ||<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||Attention Is All You Need ||[https://arxiv.org/abs/1706.03762 Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT. ||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || ||<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||Parameter-free, Dynamic, and Strongly-Adaptive Online Learning|| [https://proceedings.icml.cc/static/paper_files/icml/2020/2820-Paper.pdf Paper] || ||<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| A CLOSER LOOK AT FEW-SHOT CLASSIFICATION || https://arxiv.org/pdf/1904.04232.pdf || ||<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || placeholder||<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44177Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T20:40:25Z<p>Cfmeaney: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely avilable on GitHub [4]. It is quite easily to go through and learn - although unfortunately, it is written in TensorFlow v1.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44173Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T20:19:50Z<p>Cfmeaney: /* References */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely avilable on GitHub [4]. It is quite easily to go through and learn!<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en<br />
<br />
[4] https://github.com/maziarraissi/PINNs</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44172Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T20:19:31Z<p>Cfmeaney: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
The code used to implement PINNs and generate the figures is all freely avilable on GitHub [4]. It is quite easily to go through and learn!<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44170Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T19:58:30Z<p>Cfmeaney: /* Problem Setup */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describes the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44101Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:53:52Z<p>Cfmeaney: /* References */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).<br />
<br />
[3] https://patents.google.com/patent/US20200293594A1/en</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44100Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:53:32Z<p>Cfmeaney: /* Conclusion */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilizes existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. This procedure can be carried out for different types of data - most notably for continuous-time and discrete-time data, both of which are common in real-world applications.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations for their work with PINN. In fact, they have recently patented their method in the United States [3].<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44098Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:49:43Z<p>Cfmeaney: /* Navier-Stokes with Pressure */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain. We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math> and using these values as input to our loss function. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. The results of this optimization can be seen in figure 4. Notice again the remarkable accuracy that the PINN can achieve in the predictions of the full solution, parameter values, and pressure field. Interestingly, the predicted pressure field is off by a additive constant. This is not a surprise, as the pressure only appears in the PDEs in a gradient, meaning that it is only determinable up to an additive constant. Nonetheless, the PINN is able to predict its gradient with high accuracy.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44096Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:44:56Z<p>Cfmeaney: /* Discrete-Time Example */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at time <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44095Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:43:00Z<p>Cfmeaney: /* Continuous-Time Example */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
Assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information from the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the full solution and the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise is shown. <br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44094Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:39:22Z<p>Cfmeaney: /* Examples */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve using conventional methods because of the shock (discontinuity) formation after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44093Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:38:25Z<p>Cfmeaney: /* Continuous-Time Example */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of the continuous-time method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44092Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:31:16Z<p>Cfmeaney: /* Examples */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While many examples are given in the paper, three particular ones are detailed here to demonstrate the simplicity and utility of the PINN method.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44090Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:18:32Z<p>Cfmeaney: /* Data-Driven Discovery of PDEs */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The principle difference now is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. In conventional modelling, a parameter estimation technique would need to be first applied to the dataset which would rely on assuming the form of the PDE. Conventional parameter fitted techniques are often sensitive to noisy data, leading to errors in results generated with these fitted parameters. However, with PINNs, this parameter fitting can be done simultaneously with the training of the neural network. This change is procedure allows our parameter fitting to not simply identify the parameters that best fit the data given the PDE, but rather to find the parameters which best describe the data while using the PDE as a regularizer. The neural network training procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and therefore cover the full procedure.<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44089Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:10:31Z<p>Cfmeaney: /* Continuous-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in the example below including figures 1 and 2.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44088Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T16:10:07Z<p>Cfmeaney: /* Discrete-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete-time models, we must leverage Runge-Kutta methods - a technique for numerical solutions of differential equations. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the value of the function at the full time step. The number of intermediate points used to predict the end solution is called the stages of the Runge-Kutta method - for example, a method where four intermediate values are approximated is called a four-stage method.. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and <math display="inline"> u^{n+1} = u(t^{n+1}, x) </math> (note that <math display="inline"> c_j<1 ~ \forall ~ j=1,...,q </math>). This general form includes both explicit and implicit time-stepping schemes.<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network and trained a shared set of weights belonging to <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math>. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> as input and outputs all of the intermediate stages of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. To find the predictions at the two snapshots, the Runge-Kutta will need to be inverted and solved for the initial and final cases as functions of the stages, which is easily done. However, notice that each Runge-Kutta stage produces its own prediction for the snapshots, so our loss function will need to incorporate all of these predictions. Accordingly, our new loss function becomes:<br />
<br />
\begin{align*}<br />
SSE = SSE_n + SSE_{n+1} <br />
\end{align*}<br />
<br />
where<br />
<br />
\begin{align*}<br />
SSE_n = \sum^q_{j=1} \sum^{N_n}_{i=1} (u^n_j(x^{n,i}) - u^{n,i})^2,<br />
\end{align*}<br />
<br />
\begin{align*}<br />
SSE_{n+1} = \sum^q_{j=1} \sum^{N_{n+1}}_{i=1} (u^{n+1}_j(x^{n+1,i}) - u^{n+1,i})^2,<br />
\end{align*}<br />
<br />
<math display="inline"> N_n </math> is the number of datapoints at <math display="inline"> t_n </math>, and <math display="inline"> N_{n+1} </math> is the number of datapoints at <math display="inline"> t_{n+1} </math>. For an example of the discrete-time data case, see the example below including figure 3.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44087Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:31:39Z<p>Cfmeaney: /* Discrete-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. This case can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44086Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:31:16Z<p>Cfmeaney: /* Discrete-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44085Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:29:54Z<p>Cfmeaney: /* Navier-Stokes with Pressure */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases and can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
==== Figure 4 ====<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44084Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:29:40Z<p>Cfmeaney: /* Discrete-Time Example */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases and can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
==== Figure 3 ====<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44083Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:29:26Z<p>Cfmeaney: /* Examples */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases and can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
==== Figure 1 ====<br />
[[File:fig1_Cam.png]]<br />
<br />
==== Figure 2 ====<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44082Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:28:11Z<p>Cfmeaney: /* Continuous-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
Notice that since <math display="inline"> f </math> is all of the PDE terms moved to one side of the equation, the closer that <math display="inline"> f </math> is to zero, the better that the neural network satisfies to PDE. The full loss function used in the optimization is then taken to be the sum of these two parts:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This effectively regularizes the optimization, allowing for the network to learn from a smaller number of datapoints than would otherwise be necessary. An example of this method can be seen in figure 1.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases and can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
[[File:fig1_Cam.png]]<br />
<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44081Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:22:21Z<p>Cfmeaney: /* Continuous-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N_u} \sum_{i=1}^{N_u} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
The full loss function used in the optimization is then taken to be the sum of these two functions:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This allows the network approximate the function by training on only a small number of data points. An example of this method can be seen in figure ?.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases and can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
[[File:fig1_Cam.png]]<br />
<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaneyhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations&diff=44080Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations2020-11-14T15:18:47Z<p>Cfmeaney: /* Continuous-Time Models */</p>
<hr />
<div>== Presented by == <br />
Cameron Meaney<br />
<br />
== Introduction ==<br />
<br />
In recent years, there has been an enormous growth in the amount of data and computing power available to researchers. Unfortunately, for many real-world scenarios, the cost of data acquisition is simply too high to collect an amount of data sufficient to guarantee robustness or convergence of training algorithms. In such situations, researchers are faced with the challenge of trying to generate results based on partial or incomplete datasets. Regularization techniques or methods which can artificially inflate the dataset become particularly useful in these situations; however, such techniques are often highly dependent of the specifics of the problem.<br />
<br />
Luckily, in important real-world scenarios that we endeavor to analyze, there is often a wealth of existing information from which we can draw. This existing information commonly manifests in the form of a mathematical model, particularly a set of partial differential equations (PDEs). In this paper, the authors provide a technique for incorporating the information of a physical system contained in a PDE into the optimization of a deep neural network. This technique is most useful in situations where established PDE models exist, but where our amount of available data is too small to guarantee robustness or convergence in neural network training. In essence, the accompanying PDE model can be used as a regularization agent, constraining the space of acceptable solutions to help the optimization converge more quickly and more accurately.<br />
<br />
== Problem Setup ==<br />
<br />
Consider the following general PDE,<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> is the function we wish to find, subscripts denote partial derivatives, <math display="inline"> \vec{\lambda} </math> is the set of parameters on which the PDE depends, and <math display="inline"> N </math> is a differential, potentially nonlinear operator. This general form encompasses a wide array of PDEs used across the physical sciences including conservation laws, diffusion processes, advection-diffusion-reaction systems, and kinetic equations. Suppose that we have noisy measurements of the PDE solution, <math display="inline"> u </math>, scattered across the spatio-temporal input domain. Then, we are interested in answering two questions about the physical system:<br />
<br />
(1) Given fixed model parameters, <math display="inline"> \vec{\lambda} </math>, what can be said about the unknown hidden state <math display="inline"> u(t,x) </math>?<br />
<br />
and<br />
<br />
(2) What set of parameters, <math display="inline"> \vec{\lambda} </math>, best describe the observed data for this PDE system?<br />
<br />
== Data-Driven Solutions of PDEs ==<br />
<br />
We will begin by attempting to answer the first of the questions above. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u] = 0,<br />
\end{align*}<br />
<br />
can we estimate the full solution, <math display="inline"> u(t,x) </math>, by approximating it with a deep neural network? Note that <math display="inline"> \vec{\lambda} </math> no longer appears in the operator because we assume those values to be known. Approximating the solution of the PDE with a neural network results in what the authors refer to as a 'Physics-Informed Neural Network' (PINN). Importantly, this technique is most useful when we are in the small-data regime - for if we had lots of data, it simply wouldn't be necessary to include information from the PDE because the data alone would be sufficient. In these examples, we are seeking to learn from a very small amount of data which makes information from the PDE necessary to include in order to generate meaningful results.<br />
<br />
The paper details two cases of data: continuous-time and discrete-time. Both cases are detailed individually below.<br />
<br />
<br />
=== Continuous-Time Models ===<br />
<br />
Consider the case where our noisy measurements of the solution are randomly scattered across the spatio-temporal input domain. This case is referred to as the continuous-time case. We define the function <br />
<br />
\begin{align*}<br />
f = u_t + N[u]<br />
\end{align*}<br />
<br />
as the left hand side of the PDE above. Now assume that the PDE solution, <math display="inline"> u(t,x) </math>, can be approximated by a deep neural network. Therefore, the function <math display="inline"> f(t,x) </math> can also be approximated by a neural network since it is simply a function of <math display="inline"> u(t,x) </math>. In order to calculate <math display="inline"> f(t,x) </math> as a function of <math display="inline"> u(t,x) </math>, derivates of the network <math display="inline"> u(t,x) </math> will need to be taken with respect to its inputs. This network differentiation is accomplished using a technique called automatic differentiation [2]. Importantly, the weights of the two neural networks will be shared, since <math display="inline"> f(t,x) </math> is simply a function of <math display="inline"> u(t,x) </math>. The key idea in finding this shared set of weights is to train the networks with a loss function that has two distinct parts. The first part quantifies how well the neural network satisfies the known data points and is given by:<br />
<br />
\begin{align*}<br />
MSE_u = \frac{1}{N} \sum_{i=1}^{N} [u(t_u^i,x_u^i) - u^i]^2<br />
\end{align*}<br />
<br />
where the summation is over the set of known data points. The second part of the loss function quantifies how well the neural network satisfies the PDE and is given by:<br />
<br />
\begin{align*}<br />
MSE_f = \frac{1}{N} \sum_{i=1}^{N} [f(t_u^i,x_u^i)]^2.<br />
\end{align*}<br />
<br />
The full loss function used in the optimization is then taken to be the sum of these two functions:<br />
<br />
\begin{align*}<br />
MSE = MSE_u + MSE_f.<br />
\end{align*}<br />
<br />
By using this loss function in the optimization, information from both the known data and the known physics (from PDE) can be incorporated into the neural network. This allows the network approximate the function by training on only a small number of data points. An example of this method can be seen in figure ?.<br />
<br />
=== Discrete-Time Models ===<br />
<br />
Now consider the case where our available data is not randomly scattered across the spatio-temporal domain, but rather only present at two particular times. This is known as the discrete-time case and occurs frequently in real-world examples such as when dealing with discrete pictures or medical images with no data between them. These cases and can be dealt with in the same manner as the continuous case with a few small adjustments. To adapt the PINN technique to discrete time models, we must leverage Runge-Kutta methods for numerical solutions of differential equations [?]. Runge-Kutta methods approximate the solution of a differential equation at the next numerical time step by first approximating the solution at a set of intermediate points between the time steps, then using these values to predict the full time step. The general form of a Runge-Kutta method with <math display="inline"> q </math> stages is given by:<br />
<br />
\begin{align*}<br />
u^{n+c_i} &= u^n - \Delta t \sum^q_{j=1} a_{ij} N[u^{n+c_j}], ~ i = 1,...,q \\<br />
u^{n+1} &= u^n - \Delta t \sum^q_{j=1} b_j N[u^{n+c_j}]<br />
\end{align*}<br />
<br />
where <math display="inline"> u^{n+c_j} = u(t^n + c_j \Delta t, x) </math> and the general form includes both explicit and implicit time-stepping schemes. For more information of Runge-Kutta methods, see [?].<br />
<br />
In the continuous-time case, we had approximated the function <math display="inline"> u(t,x) </math> by a neural network. Therefore, our neural network approximation for <math display="inline"> u(t,x) </math> had two inputs and one output. In the discrete case, instead of creating a neural netowrk which takes <math display="inline"> t </math> and <math display="inline"> x </math> as input and outputs the value of <math display="inline"> u(t,x) </math>, we create a neural network which only takes <math display="inline"> x </math> and outputs the values of the solution at the intermediate points of the Runge-Kutta time-stepping scheme, <math display="inline"> [u^{n+c_j}] </math> for <math display="inline"> i=1,...,q </math>. Therefore, the PINN that we create here has one input and <math display="inline"> q </math> outputs. Importantly, information from the PDE is now incorporated into the the Runge-Kutta time-stepping scheme, so we do not need to add a term to the loss function to include it. Instead, our discrete-time loss function consists of two parts - one to quantify agreement with the data at the time of the initial data snapshot and one to quantify the agreement with data at the final data snapshot. For an example of this, see figure ?.<br />
<br />
== Data-Driven Discovery of PDEs ==<br />
<br />
After having answered the first question, we can turn our focus to the second question. Specifically, if given a small amount of noisy measurements of the solution of the PDE<br />
<br />
\begin{align*}<br />
u_t + N[u;\vec{\lambda}] = 0,<br />
\end{align*}<br />
<br />
can we estimate the values of the parameters, <math display="inline"> \vec{\lambda} </math>, that best describe the observed data? The difference between this case and the above is that we no longer know the values of the parameters <math display="inline"> \vec{\lambda} </math> appearing in the PDE. The procedure is, in essence, unchanged other than we now treat the PDE parameters as trainable parameters of the neural network. Since the discovery case outlined here is an extension of the solution case outlined above, the examples given below include unknown parameters and should therefore cover the full procedure.<br />
<br />
<br />
== Examples ==<br />
<br />
While the paper gives many examples of the PINN method, three are outlined here to demonstrate the method's utility.<br />
<br />
=== Continuous-Time Example ===<br />
<br />
For an example of this method in action, consider a problem involving Burger's equation, given by:<br />
<br />
\begin{align*}<br />
&u_t + uu_x - (0.01/\pi)u_{xx} = 0, ~ x \in [-1,1], ~ t \in [0,1], \\<br />
&u(0,x) = -\sin(\pi x), \\<br />
&u(t, -1) = u(t,1) = 0.<br />
\end{align*}<br />
<br />
Notably, Burger's equation is known as a challenging problem to solve because of the shock (discontinuity) that forms after sufficiently large time. However, using PINNs, this shockwave is easily handled.<br />
<br />
So, assume that we are given noisy measurements of the solution of Burger's equation scattered across the spatio-temporal domain. Also assume that we do not know the values of the parameters in Burger's equation - we only know the equation form:<br />
<br />
\begin{align*}<br />
&u_t + \lambda_1 uu_x - \lambda_2 u_{xx} = 0.<br />
\end{align*}<br />
<br />
Additionally, we also assume that we are ignorant of the initial conditions and boundary conditions which generate the solution. Importantly, information form the initial and boundary conditions is contained in the known data points. We define the function <math display="inline"> f(t,x) </math> as:<br />
<br />
\begin{align*}<br />
f = u_t + \lambda_1 uu_x - \lambda_2 u_{xx}<br />
\end{align*}<br />
<br />
and assume that <math display="inline"> u(t,x) </math> is approximated by a deep neural network - hence creating a PINN. Then, the shared parameters of the neural networks for <math display="inline"> u(t,x) </math> and <math display="inline"> f(t,x) </math> as well as the parameters <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math> are simultaneously learned by minimizing the combined loss function <math display="inline"> MSE = MSE_u + MSE_f </math> as defined above.<br />
<br />
For this example, assume that we have 2000 datapoints across the entire spatio-temporal domain (representing a mere 2.0% of the known data). The correct values of the parameters which are used to generate the datapoints are <math display="inline"> \lambda_1 = 1.0 </math> and <math display="inline"> \lambda_2 = 0.01/\pi </math>. Also assume that the value of the solution for each of the known datapoints is randomly perturbed by up to 1% of its value - making the dataset noisy. This problem is trained using the procedure outlined above with a deep neural network of 9 layers with 20 neurons per hidden layer and using the L-BFGS optimizer.<br />
<br />
The results of this example can be seen in figure 1. In the top panel, the exact solution can be seen with the datapoints selected for training pointed out. In the middle panel, a comparison of the exact and predicted solutions can be seen for three different times showing the accuracy of the PINN prediction. In the bottom panel, a comparison of the exact and predicted parameter values can also be seen. Also included in this bottom panel is the parameter predictions for the noiseless data case for comparison. Notice the remarkable accuracy with which the PINN is able to predict the correct parameter values in both noisy and noiseless cases. In figure 2, a comparison of the error in the predicted parameter values for different amounts of known data and noise.<br />
<br />
[[File:fig1_Cam.png]]<br />
<br />
[[File:fig2_Cam.png]]<br />
<br />
=== Discrete-Time Example ===<br />
<br />
For a discrete-time example, let us again consider the Burger's equation but only allow ourselves data at two time snapshots. Specifically, our known data consists of 199 points at time <math display="inline"> t=0.1 </math><br />
and 201 points at <math display="inline"> t=0.9 </math>. The correct parameter values and dataset noise is the same as in the continuous case and the procedure is as explained in the discrete-time section above. The neural network consists of four layers with 50 neurons per hidden layer. We choose the number of Runge-Kutta stages to be <math display="inline"> q=500 </math>, meaning that we approximate the solution at 500 intermediate time points. Note that the theoretical error estimates for a Runge-Kutta scheme with 500 stages is far below machine precision (truncation error of <math display="inline"> O(\Delta t^{2q}) </math>).<br />
<br />
The results of this example can be seen in figure 3. In the figure, the top panel shows the exact solution of Burger's equation with the known data at <math display="inline"> t=0.1 </math><br />
and <math display="inline"> t=0.9 </math>. In the middle panel, the exact solution and predicted solution are compared at the two time snapshots. In the bottom panel, the predicted parameter values are reported for noisy and noiseless data. Notice the accuracy with which the network can predict the parameter values.<br />
<br />
[[File:fig3_Cam.png]]<br />
<br />
=== Navier-Stokes with Pressure ===<br />
<br />
Naturally, there are many extensions to the base problem that the PINN method tackles. One particularly interesting example of this is illustrated in the following example.<br />
<br />
Consider the Navier-Stokes equations in two dimensions, given by:<br />
<br />
\begin{align*}<br />
u_t + \lambda_1 ( uu_x + vu_y) = -p_x + \lambda_2 (u_xx + u_yy) \\<br />
v_t + \lambda_1 (uv_x + vv_y) = -p_y + \lambda_2 (v_xx + v_yy)<br />
\end{align*}<br />
<br />
where <math display="inline"> u </math> and <math display="inline"> v </math> are the <math display="inline"> x </math> and <math display="inline"> y </math> components of the fluid velocity. In these equations, not only are there two unknown parameters, <math display="inline"> \lambda_1 </math> and <math display="inline"> \lambda_2 </math>, but there is also an entire unknown pressure field, <math display="inline"> p(t,x,y) </math>. Based on the physics of the problem, we can assume that there is a scalar function <math display="inline"> \psi </math> which satisfies <math display="inline"> \psi_y = u </math> and <math display="inline"> \psi_x = -v </math>. Assume that we have noisy measurements of the velocity field scattered across the spatio-temporal domain.<br />
<br />
We approximate <math display="inline"> \psi </math> with a PINN, proceeding as we did in the continuous case above with the addition of also approximating the pressure field with a neural network. With each training batch, the weights of both networks are updated. We can compute the components of the velocity by differentiating the network for <math display="inline"> \psi </math>. Our full loss function is defined as in the continuous case, but note that the term quantifying the satisfaction of the PDEs will depend on the pressure network.<br />
<br />
We allow ourselves 1% of the total data and optimize the network as we did before. The network has 9 layers with 20 neurons per hidden layer. See the results of this in figure ?.<br />
<br />
[[File:fig4_Cam.png]]<br />
<br />
== Conclusion ==<br />
<br />
This paper introduces physics-informed neural networks, a novel type of function-approximator neural network that utilize existing information of physical systems in order to train using a small amount of data. It does this by incorporating information from a governing PDE model into the loss function. It allows for prediction of the full solution, incorporation of noise into the measurements, estimation of model parameters appearing in the PDE, and approximation of auxiliary functions appearing in the PDE. A variation of the main technique allows for predictions originating from discrete-time data.<br />
<br />
PINNs are a powerful technique with many possible extensions. This paper and related papers by this group have received many citations and they are actually patenting this technique.<br />
<br />
== References ==<br />
<br />
[1] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 686–707.<br />
<br />
[2] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au- tomatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).</div>Cfmeaney