http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=G45sharm&feedformat=atomstatwiki - User contributions [US]2022-01-22T08:34:24ZUser contributionsMediaWiki 1.28.3http://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48961Time-series Generative Adversarial Networks2020-12-03T02:43:22Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup> where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>, and Actor-critic methods <sup>[5]</sup> that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy, and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== Critique ==<br />
The method introduced in this paper is truly a novel one. The idea of enhancing the unsupervised components of a GAN with some supervised element has shown significant jumps in certain evaluations. I think the methods of evaluation used in this paper namely, t-SNE/PCA analysis (visualization), discriminative score and predictive score; are very appropriate for this sort of analysis where the focus is on multiple things (generative accuracy and conditional dependence) both quantitatively and qualitatively. Other related works <sup>[9]</sup> have also used the same evaluation setup.<br />
<br />
The idea of the synthesized time-series being useful in terms of its predictive ability is good, especially in practice. But I think when the authors set out to create a model that can learn the temporal dynamics between time-series data then there could have been some additional metric that could better evaluate if the underlying temporal relations have been captured by the model or not. I feel the addition of some form of temporal correlation analysis would have added to the completeness of the paper.<br />
<br />
The enhancement of traditional GAN by simply adding an extra loss function to the mix is quite elegant. TimeGAN uses a stepwise supervised loss. The authors have also used very common choices for the various components of the overall TimeGAN network. This leaves a lot of possibilities in this area as many direct and indirect variations of TimeGAN or other architectures inspired by TimeGAN can be developed in a very straightforward manner of hyper-parameterizing the building blocks.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018<br />
<br />
[9] Hao Ni, L. Szpruch, M. Wiese, S. Liao, Baoren Xiao. Conditional Sig-Wasserstein GANs for Time Series Generation, 2020</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48958Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T23:44:07Z<p>G45sharm: /* GANs as variational training of deep EBMs */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png|centre]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrized sampler. In some works, regularization (by noise addition, penalizing gradients, spectral normalization) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work. The following equation show how this method does not always sample from the generator but a small proportion (with probability p) of the samples come from real examples.<br />
<br />
<div align="center">[[File:related_work_equations.png]]</div><br />
<br />
Early works showed incorporation of Fisher Information to measure similarity and this was extended to use Fisher Vector representations in case of images. Recently, Fisher Information has been used for meta learning as well. This paper explores the possibility of using Fisher Information in deep learning generative models. By utilizing the generator as a sampler, Fisher Information can be computed even from an un-normalized density model.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48957BERTScore: Evaluating Text Generation with BERT2020-12-02T23:41:21Z<p>G45sharm: /* Previous Work */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
METEOR (Banerjee & Lavie, 2005) computes Exact- <math> P_1 </math> and Exact- <math> R_1 </math> with the modification that when the exact unigram matching is not possible, matching to word stems, synonyms, and paraphrases are used instead. For example, ''running'' may be matched with ''run'' if no exact match was found. This non-exact matching is done using external tools such as a paraphrase table. In newer versions of METEOR, an external paraphrase resource is used and different weights are assigned to different matching types. <br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
<div align="center">Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> </div><br />
<br />
<div align="center">Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> </div><br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP, and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:SMNIST_perf.PNG&diff=48956File:SMNIST perf.PNG2020-12-02T23:36:43Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:RMNIST_perf.PNG&diff=48955File:RMNIST perf.PNG2020-12-02T23:36:21Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:PMNIST_perf.PNG&diff=48954File:PMNIST perf.PNG2020-12-02T23:35:56Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=48953orthogonal gradient descent for continual learning2020-12-02T23:34:35Z<p>G45sharm: /* Results */</p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real-world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Suppose we have a matrix <math>A</math> which is to be decomposed into <math>A=\hat{Q}\hat{R}</math> using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png|centre]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px|centre]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The below figure shows the performance comparison of different methods when applied on the permuted MNIST. The comparison is made based on accuracy across 3 different tasks. Training is done for 15 epochs (5 for each of the three permutations). The switch in permutations is indicated in the graph with verticle lines.<br />
<br />
[[File:PMNIST_perf.PNG|centre]]<br />
<br />
The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG|centre]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. The following figure shows the accuracies of different methods when trained on Rotated MNIST with different degrees. Each method is trained for 10 epochs (5 on standard MNIST and 5 on rotated MNIST) and predictions are made over the original MNIST. Each accuracy bar is a mean over 10 runs.<br />
<br />
[[File:RMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:RMNIST.PNG|centre]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. The following figure shows the accuracies of different methods when trained on Split MNIST.<br />
<br />
[[File:SMNIST_perf.PNG|centre]]<br />
<br />
The following table shows the classification performance for each sequential task.<br />
<br />
[[File:SMNIST.PNG|centre]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of the number of gradients stored.<br />
<br />
[[File:ogd.png|centre]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [12] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [12] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning&diff=48952orthogonal gradient descent for continual learning2020-12-02T23:17:25Z<p>G45sharm: </p>
<hr />
<div>== Authors == <br />
Mehrdad Farajtabar, Navid Azizan, Alex Mott, Ang Li<br />
<br />
== Introduction == <br />
Neural Networks suffer from <i>catastrophic forgetting</i>: forgetting previously learned tasks when trained to do new ones. Most neural networks can’t learn tasks sequentially despite having the capacity to learn them simultaneously. For example, training a CNN to look at only one label of CIFAR10 at a time results in poor performance for the initially trained labels (catastrophic forgetting). But that same CNN will perform really well if all the labels are trained simultaneously (as is standard). The ability to learn tasks sequentially is called continual learning, and it is crucially important for real-world applications of machine learning. For example, a medical imaging classifier might be able to classify a set of base diseases very well, but its utility is limited if it cannot be adapted to learn novel diseases - like local/rare/or new diseases (like Covid-19).<br />
<br />
This work introduces a new learning algorithm called Orthogonal Gradient Descent (OGD) that replaces Stochastic Gradient Descent (SGD). In standard SGD, the optimization takes no care to retain performance on any previously learned tasks, which works well when the task is presented all at once and iid. However, in a continual learning setting, when tasks/labels are presented sequentially, SGD fails to retain performance on earlier tasks. This is because when data is presented simultaneously, our goal is to model the underlying joint data distribution <math>P(X_1,X_2,\ldots, X_n)</math>, and we can sample batches like <math>(X_1,X_2,\ldots, X_m)</math> iid from this distribution, which is assumed to be "fixed" during training. In continual learning, this distribution typically shifts over time, thus resulting in the failure of SGD. OGD considers previously learned tasks by maintaining a space of previous gradients, such that incoming gradients can be projected onto an orthogonal basis of that space - minimally impacting previously attained performance.<br />
<br />
== Previous Work == <br />
<br />
Continual learning is not a new concept in machine learning, and there are many previous research articles on the subject that can help to get acquainted with the subject ([4], [9], [10] for example). These previous works in continual learning can be summarized into three broad categories. There are expansion based techniques, which add neurons/modules to an existing model to accommodate incoming tasks while leveraging previously learned representations. One of the downsides of this method is the growing size of the model with an increasing number of tasks. There are also regularization based methods, which constraints weight updates according to some important measure for previous tasks. Finally, there are the repetition based methods. These models attempt to artificially interlace data from previous tasks into the training scheme of incoming tasks, mimicking traditional simultaneous learning. This can be done by using memory modules or generative networks.<br />
<br />
== Orthogonal Gradient Descent == <br />
The key insight to OGD is leveraging the overparameterization of neural networks, meaning they have more parameters than data points. In order to learn new things without forgetting old ones, OGD proposes the intuitive notion of projecting newly found gradients onto an orthogonal basis for the space of previously optimal gradients. Such an orthogonal basis will exist because neural networks are typically overparameterized. Note that moving along the gradient direction results in the biggest change for parameter update, whereas moving orthogonal to the gradient results in the least change, which effectively prevents the predictions of the previous task from changing too much. A <i>small</i> step orthogonal to the gradient of a task should result in little change to the loss for that task, owing again to the overparameterization of the network [5, 6, 7, 8]. <br />
<br />
More specifically, OGD keeps track of the gradient with respect to each logit (OGD-ALL), since the idea is to project new gradients onto a space which minimally impacts the previous task across all logits. However, they have also done experiments where they only keep track of the gradient with respect to the ground truth logit (ODG-GTL) and with the logits averaged (OGD-AVE). OGD-ALL keeps track of gradients of dimension N*C where N is the size of the previous task and C is the number of classes. OGD-AVE and OGD-GTL only store gradients of dimension N since the class logits are either averaged or ignored respectively. To further manage memory, the authors sample from all the gradients of the old task, and they find that 200 is sufficient - with diminishing returns when using more.<br />
<br />
The orthogonal basis for the span of previously attained gradients can be obtained using a simple Gram-Schmidt (or more numerically stable equivalent) iterative method. One such algorithm which can be utilized to improve numerical stability is the modified Gram-Schmidt Orthogonalisation. The issue with the simpler Gram-Schmidt algorithm can be seen in the following:<br />
<br />
Suppose we have a matrix <math>A</math> which is to be decomposed into <math>A=\hat{Q}\hat{R}</math> using the Gram-Schmidt algorithm. During the algorithm, columns of <math>\hat{Q}</math> are solved sequentially, where <math>\hat{\vec{q_j}}</math> is the <math>j^{th}</math> column of <math>\hat{Q}</math>, and <math>\hat{r_{ij}}</math> which is the <math>i^{th}</math> row and <math>j^{th}</math> column of <math>\hat{R}</math> are solved from left to right and top to bottom for only the elements <math>\hat{R}</math> to result in a upper triangular matrix. Consider when we are calculating the third column of <math>\hat{Q}</math> as follows: <math>\hat{\vec{q_{3}}}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} - (\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>. <math> \vec{z_3}=\vec{a_3} - (\hat{\vec{q_1}}\vec{a_3})\hat{\vec{q_1}} </math> should not have a component in direction <math> \hat{\vec{q_1}}</math>, however, due to numerical stability and catastrophic cancellation [11] this is not always true. The partial result <math>\vec{z_3}</math> ends up having a component in this direction, this leads to a loss in orthogonality in the columns of <math>\hat{Q}</math>. To remedy this problem, the modified Gram-Schmidt algorithm replaces <math>\vec{a_3}</math> with <math>\vec{z_3}</math> in <math>(\hat{\vec{q_2}}\vec{a_3})\hat{\vec{q_2}}</math>, this helps in ensuring the orthogonality of the columns of <math>\hat{Q}</math> to any loss of numerical significance since we will be orthogonalizing with the vector which already has the loss of significance.<br />
<br />
<br />
<br />
<br />
Algorithm 1 shows the precise algorithm for OGD.<br />
<br />
[[File:C--Users-p2torabi-Desktop-OGD.png|centre]]<br />
<br />
And perhaps the easiest way to understand this is pictorially. Here, Task A is the previously learned task and task B is the incoming task. The neural network <math>f</math> has parameters <math>w</math> and is indexed by the <math>j</math>th logit.<br />
<br />
[[File:Pictoral_OGD.PNG|500px|centre]]<br />
<br />
== Results ==<br />
Each task was trained for 5 epochs, with tasks derived from the MNIST dataset. The network is a three-layer MLP with 100 hidden units in two layers and 10 logit outputs. The results of OGD-AVE, ODG-GTL, OGD-ALL are compared to SGD, ECW [2], (a regularization method using Fischer information for importance weights), A-GEM [3] (a state-of-the-art replay technique), and MTL (a ground truth "cheat" model which has access to all data throughout training). The experiments were performed for the following three continual learning benchmarks: permuted MNIST, rotated MNIST, and split MNIST. <br />
<br />
In permuted MNIST [1], there are five tasks, where each task is a fixed permutation that gets applied to each MNIST digit. The following tables show classification performance for each task after sequentially training on all the tasks. Thus, if solved catastrophic forgetting has been solved, the accuracies should be constant across tasks. If not, then there should be a significant decrease from task 5 through to task 1.<br />
<br />
[[File:PMNIST.PNG|centre]]<br />
<br />
Rotated MNIST is similar except instead of fixed permutation there are fixed rotations. There are five sequential tasks, with MNIST images rotated at 0, 10, 20, 30, and 40 degrees in each task. <br />
<br />
[[File:RMNIST.PNG|centre]]<br />
<br />
Split MNIST defines 5 tasks with mutually disjoint labels [4]. <br />
<br />
[[File:SMNIST.PNG|centre]]<br />
<br />
Also, the below table corresponds to the performance of Rotated MNIST and Permuted MNIST as a function of the number of gradients stored.<br />
<br />
[[File:ogd.png|centre]]<br />
<br />
Overall OGD performs much better than ECW, A-GEM, and SGD. The primary metric to look for is decreasing performance in the earlier tasks. As we can see, MTL, which represents the ideal simultaneous learning scenario shows no drop-off across tasks since all the data from previous tasks is available when training incoming tasks. For OGD, we see a decrease, but it is not nearly as severe a decrease as naively doing SGD. OGD performs much better than ECW and slightly better than A-GEM.<br />
<br />
== Review ==<br />
This work presents an interesting and intuitive algorithm for continual learning. It is theoretically well-founded and shows higher performance than competing algorithms. One of the downsides is that the learning rate must be kept very small, in order to respect the assumption that orthogonal gradients do not affect the loss. Furthermore, this algorithm requires maintaining a set of gradients which grows with the number of tasks. The authors mention several directions for future studies based on this technique. Finding a way to store more gradients or preauthorize the important directions can result in improved results. Secondly, all the proposed methods including this method fail when the tasks are dissimilar. Finding ways to maintain performance under task dissimilarity can be an interesting research direction. Thirdly, solving for learning rate sensitivity will make this method more appealing when a large learning rate is desired. Finally, another interesting future work is extending the current method to other types of optimizers such as Adam and Adagrad or even second or even quasi-Newton methods.<br />
<br />
One interesting way for increasing the learning rate can be considering the gradient magnitude of the parameters for data of the former task. If for some specific parameters, the gradient magnitude for data of task A is low then intuitively it means they have not captured a high amount of information from task A. Having this in mind, at least we can increase the learning rate for updating these weights so that we can use them for task B.<br />
<br />
A valuable resource for continual learning is the following GitHub page: [https://github.com/optimass/continual_learning_papers/blob/master/README.md#hybrid-methods link continual_learning_papers]<br />
<br />
== Critique == <br />
The authors proposed an interesting idea for mitigating catastrophic forgetting likely to happen in the online learning setting. Although Orthogonal Gradient Descent achieves state-of-the-art results in practice for continual learning, they have not provided a theoretical guarantee. [12] have derived the first generalization guarantees for the algorithm OGD for continual learning, for overparameterized neural networks. [12] also showed that OGD is only robust to catastrophic forgetting across a single task while for the arbitrary number of tasks they have proposed OGD+.<br />
<br />
== References ==<br />
[1] Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211<br />
<br />
[2] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.<br />
<br />
[3] Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. (2018). Efficient lifelong learning with A-GEM. arXiv preprint arXiv:1812.00420.<br />
<br />
[4] Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987–3995. JMLR<br />
<br />
[5] Azizan, N. and Hassibi, B. (2018). Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952<br />
<br />
[6] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166.<br />
<br />
[7] Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962.<br />
<br />
[8] Azizan, N., Lale, S., and Hassibi, B. (2019). Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization. arXiv preprint arXiv:1906.03830.<br />
<br />
[9] Nagy, D. G., & Orban, G. (2017). Episodic memory for continual model learning. ArXiv, Nips.<br />
<br />
[10] Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2017). Variational continual learning. ArXiv, Vi, 1–18.<br />
<br />
[11] Wikipedia: https://en.wikipedia.org/wiki/Loss_of_significance<br />
<br />
[12] Bennani, Mehdi Abbana, and Masashi Sugiyama. "Generalisation guarantees for continual learning with orthogonal gradient descent." arXiv preprint arXiv:2006.11942 (2020).</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:rel_wor_comparison.png&diff=48951File:rel wor comparison.png2020-12-02T23:12:11Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders&diff=48950From Variational to Deterministic Autoencoders2020-12-02T23:11:53Z<p>G45sharm: /* Related Work */</p>
<hr />
<div>== Presented by == <br />
John Landon Edwards<br />
<br />
== Introduction ==<br />
This paper presents an alternative framework to the stochastic Variational Autoencoders (VAEs) that is deterministic named the Regularized Autoencoders (RAEs) for generative modeling. The goal of VAEs is to learn from a large collection of high-dimensional samples to draw new sample from the inferred population distribution. RAEs hope to achieve the same goal without the drawbacks of VAEs in practice. The advantages of RAEs to VAEs are that they are easier to train and simpler. The paper investigates how forcing an arbitrary prior <math>p(z) </math> within VAEs could be substituted instead with a regularization scheme to the loss function. Furthermore, a generative mechanism for RAEs is proposed utilizing an ex-post density estimation step that can also be applied to existing VAEs. Finally, they conduct an empirical comparison between VAEs and RAEs to demonstrate that the latter are able to generate samples that are comparable or better when applied to domains of images and structured objects.<br />
<br />
== Motivation ==<br />
The authors point to several drawbacks currently associated with VAE's including:<br />
* the compromise between sample quality and reconstruction quality is poor<br />
* over-regularisation induced by the KL divergence term within the objective [5]<br />
* posterior collapse in conjunction with powerful decoders [1]<br />
* increased variance of gradients caused by approximating expectations through sampling [3][7]<br />
* learned posterior distribution doesn't match the latent assumption [8]<br />
<br />
These issues motivate their consideration of alternatives to the variational framework adopted by VAE's. <br />
<br />
Furthermore, the authors note that VAE's introduction of random noise within the reparameterization <math> z = \mu(x) +\sigma(x)\epsilon </math> have a regularization effect because it promotes the learning of a smoother latent space. This motivates their exploration of regularization schemes within an autoencoders loss function which could substitute the VAE's random noise injection. This would allow for the elimination of the variational framework and to circumvent its associated drawbacks.<br />
<br />
Due to the deterministic nature of RAES, it is impossible to sample from <math>p(z)</math> to produce generated samples. The authors provide a solution to this problem by fitting a density estimate of the latent post-training to generate new samples.<br />
<br />
== Related Work ==<br />
<br />
The authors point to similarities between their framework and Wasserstein Autoencoders (WAEs) [5] where a deterministic version can be trained. However, the RAEs utilize a different loss function and differs in their implementation of the ex-post density estimation. Additionally, they suggest that Vector Quantized-Variational AutoEncoders (VQ-VAEs) [1] can be viewed as deterministic. VQ-VAES also adopt ex-post density estimation but implement this through a discrete auto-regressive method. Furthermore, VQ-VAEs utilize a different training loss that is non-differentiable.<br />
<br />
The following figure summarizes the qualitative performance of the various methods in related work. Looking at sample quality for VAEs, WAEs, 2sVAEs, RAEs and CelebA; RAEs can be seen to have a slight edge in terms of sharpness of samples and their reconstructions with smooth interpolation into the latent space.<br />
<br />
[[File:rel_wor_comparison.png|center]]<br />
<br />
== Framework Architecture ==<br />
=== Overview ===<br />
The Regularized Autoencoder proposes three modifications to the existing VAEs framework. Firstly, eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z </math>. Secondly, it proposes a resigned loss function <math>\mathcal{L}_{RAE}</math>. Finally, it proposes an ex-post density estimation procedure for generating samples from the RAE.<br />
<br />
<br />
=== Eliminating Random Noise ===<br />
The authors proposes eliminating the injection of random noise <math>\epsilon</math> from the reparameterization of the latent variable <math> z = \mu(x) +\sigma(x)\epsilon </math> resulting in a Encoder <math>E_{\phi} </math> that deterministically maps a data point <math> x </math> to a latent variable <math> z </math>.<br />
<br />
The current variational framework of VAEs enforces regularization on the encoder posterior through KL-divergence term of its training loss function:<br />
\begin{align}<br />
\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\phi}(z|x)}\log p_{\theta}(x|z) + \mathbb{KL}(q_{\phi}(z|x) | p(z))<br />
\end{align}<br />
<br />
In eliminating the random noise within <math>z</math> the authors suggest substituting the losses KL-divergence term with a form of explicit regularization. This makes sense because <math>z</math> is no longer a distribution and <math>p(x|z)</math> would be zero almost everywhere. Also as the KL-divergence term previously enforced regularization on the encoder posterior so its plausible that an alternative regularization scheme could impact the quality of sample results. This substitution of the KL-divergence term leads to redesigning the training loss function used by RAEs.<br />
<br />
=== Redesigned Training Loss Function ===<br />
The resigned loss function <math>\mathcal{L}_{RAE}</math> is defined as:<br />
\begin{align}<br />
\mathcal{L}_{RAE} = \mathcal{L}_{REC} + \beta \mathcal{L}^{RAE}_Z + \lambda \mathcal{L}_{REG}\\<br />
\end{align}<br />
where <math>\lambda</math> and <math>\beta</math> are hyper parameters.<br />
<br />
The first term <math>\mathcal{L}_{REC}</math> is the reconstruction loss, defined as the mean squared error between input samples and their mean reconstructions <math>\mu_{\theta}</math> by a decoder that is deterministic. In the paper it is formally defined as:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - \mathbf{\mu_{\theta}}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
However, as the decoder <math>D_{\theta}</math> is deterministic the reconstruction loss is equivalent to:<br />
\begin{align}<br />
\mathcal{L}_{REC} = ||\mathbf{x} - D_{\theta}(E_{\phi}(\mathbf{x}))||_2^2<br />
\end{align}<br />
<br />
The second term <math>\mathcal{L}^{RAE}_Z</math> is defined as :<br />
\begin{align}<br />
\mathcal{L}^{RAE}_Z = \frac{1}{2}||\mathbf{z}||_2^2<br />
\end{align}<br />
This is equivalent to constraining the size of the learned latent space, which prevents unbounded optimization.<br />
<br />
The third term <math>\mathcal{L}_{REG}</math> acts as the explicit regularizer to the decoder. The authors consider the following potential formulations for <math>\mathcal{L}_{REG}</math><br />
<br />
;'''Tikhonov regularization'''(Tikhonov & Arsenin, 1977):<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\theta||_2^2<br />
\end{align} <br />
<br />
;''' Gradient Penalty: '''<br />
\begin{align}<br />
\mathcal{L}_{REG} = ||\nabla_{x} D_{\theta}(E_\phi(x)) ||_2^2<br />
\end{align}<br />
<br />
;'''Spectral Normalization:'''<br />
:The authors also consider using Spectral Normalization in place of <math>\mathcal{L}_{REG}</math> whereby each weight matrix <math>\theta_{\ell}</math> in the decoder network is normalized by an estimate of it largest singular value <math>s(\theta_{\ell})</math>. Formally this is defined as:<br />
\begin{align}<br />
\theta_{\ell}^{SN} = \theta_{\ell} / s(\theta_{\ell})\\<br />
\end{align}<br />
<br />
=== Ex-Post Density Estimation ===<br />
Recall that since the autoencoder is no longer stochastic, it may prove to be a challenge to sample from the latent space to generate new samples. However, the author proposes to fit a density estimator <math>q_{\delta}(\mathbf{z})</math> over the trained latent spaces points <math>\{\mathbf{z}=E_{\phi}(\mathbf{x})|\mathbf{x} \in \chi\} </math> to solve this problem. They can then sample using the estimated density to produce decoded samples. The authors note the choice of density estimator here needs to balance a trade-off of expressiveness and simplicity whereby a good fit of the latent points is produced but still allowing for generalization to untrained points. It is noteworthy that, even in VAE where one would sample from the prespecified <math>p(z)</math>, the generative mechanism is not perfect either, as often times the posterior <math>q_{\phi}(z)</math> can depart a lot from <math>p(z)</math> and thus the sampled <math>z</math> might fall into regions that the decoder hasn't seen. Therefore, intuitively the use of an estimated density is not likely to be more compromising than <math>p(z)</math> already is in VAE.<br />
<br />
== Empirical Evaluations ==<br />
===Image Modeling:===<br />
===== Models Evaluated:=====<br />
The authors evaluate regularization schemes using Tikonov Regularization , Gradient Penalty, and Spectral Normalization. These correspond with models (RAE-L2) ,(RAE-GP) and (RAE-SN) respectively, as seen in '''figure 1'''. Additionally they consider a model (RAE) where <math>\mathcal{L}_{REC} </math> is excluded from the loss and a model (AE) where both <math>\mathcal{L}_{REC} </math> and <math>\mathcal{L}^{RAE}_{Z} </math> are excluded from the loss. For a baseline comparison they evaluate a regular Gaussian VAE (VAE), a constant-variance Gaussian (CV-VAE) VAE, a Wassertien Auto-Encoder (WAE) with MMD loss, and a 2-stage VAE [2] (2sVAE).<br />
<br />
==== Metrics of Evaluation: ====<br />
Each model was evaluated on the following metrics:<br />
* '''Rec''': Test sample reconstruction where the French Inception Distance (FID) is computed between a held-out test sample and the networks outputted reconstruction.<br />
* <math>\mathcal{N}</math>: FID calculated between test data and random samples from a single Gaussian that is either <math>p(z)</math> fixed for VAEs and WAEs, a learned second stage VAE for 2sVAEs, or a single Gaussian fit to <math>q_{\delta}(z)</math> for CV-VAEs and RAEs.<br />
*'''GMM:''' FID is calculated between test data and random samples generated by fitting a mixture of 10 Gaussians in the latent space for each of the models.<br />
*'''Interp:''' Mid-point interpolation between random pairs of test reconstructions.<br />
<br />
==== Qualitative evaluation for sample quality on MNIST ====<br />
The following figure shows the qualitative evaluation for sample quality for VAEs, WAEs, and RAEs on MNIST. The first figure in the extreme left depicts the reconstructed samples (top row is ground truth) followed by randomly generated samples in the middle and spherical interpolations between two images at the extreme right. <br />
<br />
[[File:Paper4_ImageModeling.png|Paper4_ImageModeling.png|center]]<br />
<br />
These are remarkable results that show that the lack of an explicitly fixed structure on the latent space of the RAE does not impede interpolation quality. <br />
<br />
==== Results:====<br />
Each model was trained and evaluated on the MNIST, CIFAR, and CELEBA datasets. Their performance across each metric and each dataset can be seen in '''figure 1'''. For the GMM metric and for each dataset, all RAE variants with regularization schemes outperform the baseline models. Furthermore, for <math>\mathcal{N}</math> the RAE regularized variants outperform the baseline models within the CIFAR and CELEBA datasets. This suggests RAE's can achieve competitive results for generated image quality when compared to existing VAE architectures.<br />
<br />
[[File:Image Gen Res.png|Image Gen Res.png|center]]<br />
<div align="center">'''Figure 1:''' Image Generation Results </div><br />
<br />
=== Modelling Structured Objects ===<br />
====Overview====<br />
The authors evaluate RAEs ability to model the complex structured objects of molecules and arithmetic expressions. They adopt the exact architecture and experimental setting of the GrammarVAE (GVAE)[6] and replace its variational framework with that of an RAE's utilizing the Tikonov regularization (GRAE).<br />
<br />
==== Metrics of Evaluation ====<br />
In this experiment, they are interested in traversing the learned latent space to generate samples for drug molecules and expressions. To evaluate the performance with respect to expressions, they consider <math>\log(1 + MSE)</math> between generated expressions and the true data. To evaluate the performance with respect to molecules they evaluate the water-octanol partition coefficient <math>\log(P)</math> where a higher value corresponds to a generated molecule having a more similar structure to that of a drug molecule. They compare the GRAEs performance on these metrics to those of the GVAE, the constant variance GVAE (GCVVAE), and the CharacterVAE (CVAE) [4] as seen in '''figure 2'''. Additionally, to assess the behavior within the latent space, they report the percentages of expressions and molecules with valid syntax's within the generated samples.<br />
<br />
==== Results ====<br />
Their results displayed in '''figure 2''' show that the VRAE is competitive in its ability to generate samples of structured objects and even outperform the other models with respect to average score for generated expressions. It is notable that for generating molecules although they rank second in average score, it produces the highest percentage of syntactically valid molecules.<br />
[[File:complex obj res.png|center]]<br />
<div align="center">'''Figure 2:''' Complex Object Generation Results </div><br />
<br />
== Conclusion ==<br />
The authors provide empirical evidence that deterministic autoencoders are capable of learning a smooth latent space without the requirement of a prior distribution. This allows for the circumvention of drawbacks associated with the variational framework.<br />
By comparing the performance between VAEs and RAE's across the tasks of image and structured object sample generation the authors have demonstrated that RAEs are capable of producing comparable or better sample results.<br />
<br />
== Critiques ==<br />
There is empirical evidence to support the sample quality of RAES is comparable to VAE’s. The Authors are inconclusive in determining how the different variants of regularization schemes affect the RAE’s performance as there was much variation between them for datasets. They do note they opted to use the L2 version in the structured objects experiment because it was the simplest to implement.<br />
There is also empirical evidence that using the ex-post density estimation when applied to existing VAE frameworks improves their sample quality as seen in the image generation experiment, this offers a plausible way to potentially improve existing VAE architectures. My Overall impression of the paper is they provided substantial evidence that a deterministic autoencoder can learn a latent space that is of comparable or better quality than that of a VAE. Although they observe favorable results for their RAE framework, it's still far from conclusive whether RAE will perform better in all data domains. A future comparison I would be interested in seeing is with VQ-VAE’s in the domain of sound generation.<br />
<br />
== Repository ==<br />
<br />
The official repository for this paper is available at <span class="plainlinks">[https://github.com/ParthaEth/Regularized_autoencoders-RAE- "official repository"]</span><br />
<br />
== References ==<br />
<br />
<br />
[1] Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017<br />
<br />
[2] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019<br />
<br />
[3] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR:low-variance, unbiased gradient estimates for discrete latent variable models. In NeurIPS, 2017<br />
<br />
[4] Gómez-Bombarelli, Rafael, Jennifer N., Wei, David, Duvenaud, José Miguel, Hernández-Lobato, Benjamín, Sánchez-Lengeling, Dennis, Sheberla, Jorge, Aguilera-Iparraguirre, Timothy D., Hirzel, Ryan P., Adams, and Alán, Aspuru-Guzik. "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules".ACS Central Science 4, no.2 (2018): 268–276.<br />
<br />
[5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Scholkopf. Wasserstein autoencoders. In ICLR, 2017<br />
<br />
[6] Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In ICML, 2017.<br />
<br />
[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.<br />
<br />
[8] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=48949a fair comparison of graph neural networks for graph classification2020-12-02T23:01:36Z<p>G45sharm: /* GIN */</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
<br />
Experimental reproducibility in machine learning has been known to be an issue for some time. Researchers attempting to reproduce the results of old algorithms have some up short, raising concerns that lack of reproducibility hurts the quality of the field. Lack of open source AI code has only exacerbated this, leading some to go so far as to say that "AI faces a reproducibility crisis" [1]. It has been argued that the ability to reproduce existing AI code, and making these codes and new ones open source is a key step in lowering the socio-economic barriers of entry into data science and computing. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models that take graph-structured data as input and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' (a) is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' (b) is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' (c) is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
<gallery mode="packed"><br />
Image:bipartite.png| (a) Bipartite Graph<br />
Image:path.gif| (b) Path Graph<br />
Image:cycle.png| (c) Cycle Graph<br />
</gallery><br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assessment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross-Validation.<br />
As the model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyperparameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Overview of Reproducibility Issues==<br />
The paper explores five different GNN models exploring issues with their experimental setup and potential reproducibility. <br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
===Issues with DGCNN (Deep Graph Convolutional Neural Network)===<br />
The authors of DGCNN use a faulty method of tuning learning rate and epoch. They used only a single fold for tuning hyperparameters despite evaluating the model on 10-fold CV. This potentially leads to suboptimal performance. They haven't released the code for the experiments. Lastly, they average the one-fold CV across 10 folds and then report the numbers. This also reduces variance.<br />
<br />
=== Issues with DiffPoll === <br />
It has not been clearly stated in the paper whether the results come from a test set or if they come from a validation set. Moreover, standard deviation over the 10-fold CV has also not been reported. Due to no random seeds, different data splits are there while performing multi-fold splits (without stratification).<br />
<br />
=== Issue with ECC ===<br />
The results of the paper do not report the standard deviation obtained during the 10-fold Cross Validation. Like in the case of GDCNN, the model selection procedure is not made clear due to pre-determined hyper-parameters. Code repository is not available as well.<br />
<br />
=== Issues with GIN === <br />
Instead of reporting the test accuracy, the authors have given the validation accuracy over the 10-fold CV. Therefore, the given results are not suitable for evaluating the model. Code repository is not available for selecting the model.<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
a subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
===Effect of Node Degree on Layering===<br />
[[File:Paper11_NodeDegree.png]]<br />
<br />
The above table displays the median number of selected layers in relation to the addition of node<br />
degrees as input features on all social datasets. 1 indicates that an uninformative feature is used as<br />
a node label.<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
<br />
<br />
In the above figure, we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in the literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
== Source Codes ==<br />
The data and scripts to reproduce the experiments reported in the paper are available at https://github.com/diningphil/gnn-comparison .<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
Also as well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.<br />
<br />
- Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5165–5175. Curran Associates, Inc., 2018.<br />
<br />
[1] Hutson, M. (2018). Artificial intelligence faces a reproducibility crisis. Science, 359(6377), 725–726.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification&diff=48948a fair comparison of graph neural networks for graph classification2020-12-02T23:01:22Z<p>G45sharm: /* Overview of Reproducibility Issues */</p>
<hr />
<div>== Presented By ==<br />
Jaskirat Singh Bhatia<br />
<br />
==Background==<br />
<br />
Experimental reproducibility in machine learning has been known to be an issue for some time. Researchers attempting to reproduce the results of old algorithms have some up short, raising concerns that lack of reproducibility hurts the quality of the field. Lack of open source AI code has only exacerbated this, leading some to go so far as to say that "AI faces a reproducibility crisis" [1]. It has been argued that the ability to reproduce existing AI code, and making these codes and new ones open source is a key step in lowering the socio-economic barriers of entry into data science and computing. Recently, the graph representation learning<br />
field has attracted the attention of a wide research community, which resulted in<br />
a large stream of works. As such, several Graph Neural Network models have<br />
been developed to effectively tackle graph classification. However, experimental<br />
procedures often lack rigorousness and are hardly reproducible. The authors tried to reproduce <br />
the results from such experiments to tackle the problem of ambiguity in experimental procedures <br />
and the impossibility of reproducing results. They also Standardized the experimental environment <br />
so that the results could be reproduced while using this environment.<br />
<br />
==Graph Neural Networks==<br />
A graph is a data structure consisting of nodes and edges. Graph neural networks are models that take graph-structured data as input and capture information of the input graph, such as relation and interaction between nodes. In graph neural networks, nodes aggregate information from their neighbours. The key idea is to generate representations of nodes depending on the graph structure. <br />
<br />
Graph neural networks can perform various tasks and have been used in many applications. Some simple and typical tasks include classifying the input graph or finding a missing edge/ node in the graph. One example of real applications where GNNs are used is social network prediction and recommendation, where the input data is naturally structural.<br />
<br />
====Graph basics====<br />
<br />
Graphs come from discrete mathematics and as previously mentioned are comprised of two building blocks, vertices (nodes), <math>v_i \in V</math>, and edges, <math>e_j \in E</math>. The edges in a graph can also have a direction associated with them lending the name '''directed graph''' or they can be an '''undirected graph''' if an edge is shared by two vertices and there is no sense of direction. Vertices and edges of a graph can also have weights to them or really any amount of features imaginable. <br />
<br />
Now going one level of abstraction higher graphs can be categorized by structural patterns, we will refer to these as the types of graphs and this will not be an exhaustive list. A '''Bipartite graph''' (a) is one in which there are two sets of vertices <math>V_1</math> and <math>V_2</math> and there does not exist, <math> v_i,v_j \in V_k </math> where <math>k=1,2</math> s.t. <math>v_i</math> and <math>v_j </math> share an edge, however, <math>\exists v_i \in V_1, v_j \in V_2</math> where <math>v_i</math> and <math>v_j </math> share an edge. A '''Path graph''' (b) is a graph where, <math>|V| \geq 2</math> and all vertices are connected sequentially meaning each vertex except the first and last have 2 edges, one coming from the previous vertex and one going to the next vertex. A '''Cycle graph''' (c) is similar to a path graph except each node has 2 edges and are connected in a loop, meaning if you start at any vertex and follow an edge of each node going in one direction it will eventually lead back to the starting node. These are just three examples of graph types in reality there are many more and it can beneficial to be able to connect the structure of ones data to an appropriate graph type.<br />
<br />
<gallery mode="packed"><br />
Image:bipartite.png| (a) Bipartite Graph<br />
Image:path.gif| (b) Path Graph<br />
Image:cycle.png| (c) Cycle Graph<br />
</gallery><br />
<br />
==Problems in Papers==<br />
Some of the most common reproducibility problems encountered in this field concern hyperparameters<br />
selection and the correct usage of data splits for model selection versus model assessment.<br />
Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not<br />
standardized across different works in terms of node and edge features.<br />
<br />
These issues easily generate doubts and confusion among practitioners that need a fully transparent<br />
and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through<br />
two different phases, namely model selection on the validation set and model assessment on the<br />
test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and<br />
biased estimates of the true performance of a model, making it hard for other researchers to present<br />
competitive results without following the same ambiguous evaluation procedures.<br />
<br />
==Risk Assessment and Model Selection==<br />
'''Risk Assessment<br />
<br />
The goal of risk assessment is to provide an estimate of the performance of a class of models.<br />
When a test set is not explicitly given, a common way to proceed is to use k-fold Cross-Validation.<br />
As the model selection is performed independently for<br />
each training/test split, they obtain different “best” hyper-parameter configurations; this is why they<br />
refer to the performance of a class of models. <br />
<br />
'''Model Selection<br />
<br />
The goal of model selection, or hyperparameter tuning, is to choose among a set of candidate hyperparameter<br />
configurations the one that works best on a specific validation set. It also important to acknowledge the selection bias when selecting a model as this makes the validation accuracy of a selected model from a pool of candidates models a biased test accuracy.<br />
<br />
==Overview of Reproducibility Issues==<br />
The paper explores five different GNN models exploring issues with their experimental setup and potential reproducibility. <br />
===The GNN's were selected based on the following criteria===<br />
<br />
1. Performances obtained with 10-fold CV<br />
<br />
2. Peer reviews<br />
<br />
3. Strong architectural differences<br />
<br />
4. Popularity<br />
<br />
===Criteria to assess the quality of evaluation and reproducibility was as follows===<br />
<br />
1. Code for data pre-processing<br />
<br />
2. Code for model selection<br />
<br />
3. Data splits are provided<br />
<br />
4. Data is split by means of a stratification technique<br />
<br />
5. Results of the 10-fold CV are reported correctly using standard deviations<br />
<br />
Using the following criteria, 4 different papers were selected and their assessment on the quality of evaluation and reproducibility is as follows:<br />
<br />
[[File:table_3.png|700px|Image: 700 pixels|]]<br />
<br />
Where (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A)<br />
indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria).<br />
<br />
===Issues with DGCNN (Deep Graph Convolutional Neural Network)===<br />
The authors of DGCNN use a faulty method of tuning learning rate and epoch. They used only a single fold for tuning hyperparameters despite evaluating the model on 10-fold CV. This potentially leads to suboptimal performance. They haven't released the code for the experiments. Lastly, they average the one-fold CV across 10 folds and then report the numbers. This also reduces variance.<br />
<br />
=== Issues with DiffPoll === <br />
It has not been clearly stated in the paper whether the results come from a test set or if they come from a validation set. Moreover, standard deviation over the 10-fold CV has also not been reported. Due to no random seeds, different data splits are there while performing multi-fold splits (without stratification).<br />
<br />
=== Issue with ECC ===<br />
The results of the paper do not report the standard deviation obtained during the 10-fold Cross Validation. Like in the case of GDCNN, the model selection procedure is not made clear due to pre-determined hyper-parameters. Code repository is not available as well.<br />
<br />
=== GIN === <br />
Instead of reporting the test accuracy, the authors have given the validation accuracy over the 10-fold CV. Therefore, the given results are not suitable for evaluating the model. Code repository is not available for selecting the model.<br />
<br />
==Experiments==<br />
They re-evaluate the above-mentioned models<br />
on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely<br />
follows the rigorous practices as described earlier.<br />
In addition, they implemented two baselines<br />
whose purpose is to understand the extent to which GNNs are able to exploit structural information.<br />
<br />
===Datasets===<br />
<br />
All graph datasets used are publicly available (Kersting et al., 2016) and represent a relevant<br />
a subset of those most frequently used in literature to compare GNNs.<br />
<br />
===Features===<br />
<br />
In GNN literature, it is common practice to augment node descriptors with structural<br />
features. In general, good experimental practices suggest that all models should be consistently compared to<br />
the same input representations. This is why they re-evaluate all models using the same node features.<br />
In particular, they use one common setting for the chemical domain and two alternative settings<br />
as regards the social domain.<br />
<br />
===Baseline Model===<br />
<br />
They adopted two distinct baselines, one for chemical and one for social datasets. On all<br />
chemical datasets but for ENZYMES, they follow Ralaivola et al. (2005); Luzhnica et al. (2019)<br />
and implement the Molecular Fingerprint technique. On social domains<br />
and ENZYMES (due to the presence of additional features), they take inspiration from the work of<br />
Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes.<br />
<br />
===Experimental Setting===<br />
<br />
1. Used a 10-fold CV for model assessment<br />
and an inner holdout technique with a 90%/10% training/validation split for model selection.<br />
<br />
2. After each model selection, they train three times on the whole training fold, holding out a random fraction<br />
(10%) of the data to perform early stopping.<br />
<br />
3. The final test fold score is<br />
obtained as the mean of these three runs<br />
<br />
4. To be consistent with the literature, they implemented early stopping with patience parameter<br />
n, where training stops if n epochs have passed without improvement on the validation set.<br />
<br />
<br />
[[File:image_1.png|900px|center|Image: 900 pixels]]<br />
<div align="center">'''Figure 2:''' Visualization Of the Evaluation Framework </div><br />
In order to better understand the Model Selection and the Model Assessment sections in the above figure, one can also take a look at the pseudo codes below.<br />
[[File:pseudo-code_paper11.png|900px|center|Image: 900 pixels]]<br />
<br />
===Hyper-Parameters===<br />
<br />
1. Hyper-parameter tuning was performed via grid search.<br />
<br />
2. They always included the hyper-parameters used by<br />
other authors in their respective papers.<br />
<br />
===Computational Considerations===<br />
<br />
As their research included a large number of training-testing cycles, they had to limit some of the models by:<br />
<br />
1. For all models, grid sizes ranged from 32 to 72 possible configurations, depending on the number of<br />
hyper-parameters to choose from.<br />
<br />
2. Limited the time to complete a single training to 72 hours.<br />
<br />
[[File:table_1.png|900px|Image: 900 pixels]]<br />
[[File:table_2.png|900px|Image: 900 pixels]]<br />
<br />
===Effect of Node Degree on Layering===<br />
[[File:Paper11_NodeDegree.png]]<br />
<br />
The above table displays the median number of selected layers in relation to the addition of node<br />
degrees as input features on all social datasets. 1 indicates that an uninformative feature is used as<br />
a node label.<br />
<br />
<br />
===Comparison with Published Results===<br />
[[File:paper11.png|900px|Image: 900 pixels]]<br />
<br />
<br />
In the above figure, we can see the comparison between the average values of test results obtained by the authors of the paper and those reported in the literature. The plots show how the test accuracies calculated in this paper are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent.<br />
== Source Codes ==<br />
The data and scripts to reproduce the experiments reported in the paper are available at https://github.com/diningphil/gnn-comparison .<br />
==Conclusion==<br />
<br />
1. Highlighted ambiguities in the experimental settings of different papers<br />
<br />
2. Proposed a clear and reproducible procedure for future comparisons<br />
<br />
3. Provided a complete re-evaluation of four GNNs<br />
<br />
4. Found out that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet.<br />
<br />
<br />
==Critique==<br />
This paper raises an important issue about the reproducibility of some important 5 graph neural network models on 9 datasets. The reproducibility and replicability problems are very important topics for science in general and even more important for fast-growing fields like machine learning. The authors proposed a unified scheme for evaluating reproducibility in graph classification papers. This unified approach can be used for future graph classification papers such that comparison between proposed methods become clearer. The results of the paper are interesting as in some cases the baseline methods outperform other proposed algorithms. Finally, I believe one of the main limitations of the paper is the lack of technical discussion. For example, this was a good idea to discuss in more depth why baseline models are performing better? Or why the results across different datasets are not consistent? Should we choose the best GNN based on the type of data? If so, what are the guidelines?<br />
<br />
Also as well known in the literature of GNNs that they are designed to solve the non-Euclidean problems on graph-structured data. This is kinds of problems are hardly be handled by general deep learning techniques and there are different types of designed graphs that handle various mechanisms i.e. heat diffusion mechanisms. In my opinion, there would a better way to categorize existing GNN models into spatial and spectral domains and reveal connections among subcategories in each domain. With the increase of the GNNs models, further analysis must be handled to establish a strong link across the spatial and spectral domains to be more interpretable and transparent to the application.<br />
<br />
==References==<br />
<br />
- Davide Bacciu, Federico Errica, and Alessio Micheli. Contextual graph Markov model: A deep<br />
and generative approach to graph processing. In Proceedings of the International Conference<br />
on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp.<br />
294–303. PMLR, 2018.<br />
<br />
- Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes without<br />
alignments. Journal of molecular biology, 330(4):771–783, 2003.<br />
<br />
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In<br />
Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034. Curran Associates,<br />
Inc., 2017.<br />
<br />
- Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark<br />
data sets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.<br />
de.<br />
<br />
- Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5165–5175. Curran Associates, Inc., 2018.<br />
<br />
[1] Hutson, M. (2018). Artificial intelligence faces a reproducibility crisis. Science, 359(6377), 725–726.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48947Time-series Generative Adversarial Networks2020-12-02T22:45:32Z<p>G45sharm: /* Optimization */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup> where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>, and Actor-critic methods <sup>[5]</sup> that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>\lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy, and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48946Time-series Generative Adversarial Networks2020-12-02T22:45:19Z<p>G45sharm: /* Optimization */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup> where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>, and Actor-critic methods <sup>[5]</sup> that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
=== Optimization ===<br />
The embedding and recovery components of TimeGAN are trained to minimize the Supervised loss and Recovery loss. If <math> \theta_{e} </math> and <math> \theta_{r} </math> denote their parameters, then the paper proposes the following as the optimization problem for these two components:<br />
Formula. <div align="center"> [[File:Paper27_eq1.PNG]] </div><br />
Here <math>/lambda</math> >= 0 is used to regularize (or balance) the two losses. <br />
The other components of TimeGAN i.e. generator and discriminator are trained to minimize the Supervised loss along with Unsupervised loss. This optimization problem is formulated as below:<br />
Formula. <div align="center"> [[File:Paper27_eq2.PNG]] </div> Here <math> \eta >= 0 </math> is used to regularize the two losses.<br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data), and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy, and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
== Source Code ==<br />
<br />
The GitHub repository for the paper is https://github.com/jsyoon0823/TimeGAN .<br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in the future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=48945Functional regularisation for continual learning with gaussian processes2020-12-02T22:39:52Z<p>G45sharm: /* Selection of the Inducing Points */</p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularize Continual Learning (CL) so that it doesn't forget previously learned tasks. This method, referred to as functional regularization for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then utilized in optimization as a regularizer to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimization of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularization-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularize the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularizing weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimized using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularization-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularizes the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinte-dimensional generalization of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimization in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterized by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrized by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarize information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrize <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrized by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimizing the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximizing the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularize the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximized is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
As a result, we regularize the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimization computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularization term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrized by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. On the right side of the image, the round dots represent the data points and each colour corresponds to a different label. The left part of the image shows how optimized inducing images cover examples from all classes as opposed to the randomized inducing points where each example could have a skewed number of points from the same class.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasize that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimized using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularization-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimization scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularization terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularization terms can make optimization more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:inducing-points-extended.png&diff=48944File:inducing-points-extended.png2020-12-02T22:35:39Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes&diff=48943Functional regularisation for continual learning with gaussian processes2020-12-02T22:35:06Z<p>G45sharm: /* Selection of the Inducing Points */</p>
<hr />
<div>== Presented by == <br />
Meixi Chen<br />
<br />
== Introduction ==<br />
<br />
Continual Learning (CL) refers to the problem where different tasks are fed to a model sequentially, such as training a natural language processing model on different languages over time. A major challenge in CL is a model forgets how to solve earlier tasks. This paper proposed a new framework to regularize Continual Learning (CL) so that it doesn't forget previously learned tasks. This method, referred to as functional regularization for Continual Learning, leverages the Gaussian process to construct an approximate posterior belief over the underlying task-specific function. The posterior belief is then utilized in optimization as a regularizer to prevent the model from completely deviating from the earlier tasks. The estimation of the posterior functions is carried out under the framework of approximate Bayesian inference.<br />
<br />
== Previous Work ==<br />
<br />
There are two types of methods that have been widely used in Continual Learning.<br />
<br />
===Replay/Rehearsal Methods===<br />
<br />
This type of method stores the data or its compressed form from earlier tasks. The stored data is replayed when learning a new task to mitigate forgetting. It can be used for constraining the optimization of new tasks or joint training of both previous and current tasks. However, it has two disadvantages: 1) Deciding which data to store often remains heuristic; 2) It requires a large quantity of stored data to achieve good performance.<br />
<br />
===Regularization-based Methods===<br />
<br />
These methods leverage sequential Bayesian inference by putting a prior distribution over the model parameters in the hope to regularize the learning of new tasks. Elastic Weight Consolidation (EWC) and Variational Continual Learning (VCL) are two important methods, both of which make model parameters adaptive to new tasks while regularizing weights by prior knowledge from the earlier tasks. Nonetheless, this might still result in an increased forgetting of earlier tasks with long sequences of tasks.<br />
<br />
== Comparison between the Proposed Method and Previous Methods ==<br />
<br />
===Comparison to replay/rehearsal methods===<br />
<br />
'''Similarity''': It also stores data from earlier tasks.<br />
<br />
'''Difference''': Instead of storing a subset of data, it stores a set of ''inducing points'', which can be optimized using criteria from GP literature [2] [3] [4].<br />
<br />
===Comparison to regularization-based methods===<br />
<br />
'''Similarity''': It is also based on approximate Bayesian inference by using a prior distribution that regularizes the model updates.<br />
<br />
'''Difference''': It constrains the neural network on the space of functions rather than weights by making use of ''Gaussian processes'' (GP).<br />
<br />
== Recap of the Gaussian Process ==<br />
<br />
'''Definition''': A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution [1].<br />
<br />
The Gaussian process is a non-parametric approach as it can be viewed as an infinte-dimensional generalization of multivariate normal distributions. In a very informal sense, it can be thought of as a distribution of continuous functions - this is why we make use of GP to perform optimization in the function space. A Gaussian process over a prediction function <math>f(\boldsymbol{x})</math> can be completely specified by its mean function and covariance function (or kernel function), <br />
\[\text{Gaussian process: } f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}),K(\boldsymbol{x},\boldsymbol{x}'))\]<br />
Note that in practice the mean function is typically taken to be 0 because we can always write <math>f(\boldsymbol{x})=m(\boldsymbol{x}) + g(\boldsymbol{x})</math> where <math>g(\boldsymbol{x})</math> follows a GP with 0 mean. Hence, the GP is characterized by its kernel function.<br />
<br />
In fact, we can connect a GP to a multivariate normal (MVN) distribution with 0 mean, which is given by<br />
\[\text{Multivariate normal distribution: } \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}).\]<br />
When we only observe finitely many <math>\boldsymbol{x}</math>, the function's value at these input points is a multivariate normal distribution with covariance matrix parametrized by the kernel function.<br />
<br />
Note: Throughout this summary, <math>\mathcal{GP}</math> refers the the distribution of functions, and <math>\mathcal{N}</math> refers to the distribution of finite random variables.<br />
<br />
''' A One-dimensional Example of the Gaussian Process '''<br />
<br />
In the figure below, the red dashed line represents the underlying true function <math>f(x)</math> and the red dots are the observation taken from this function. The blue solid line indicates the predicted function <math>\hat{f}(x)</math> given the observations, and the blue shaded area corresponds to the uncertainty of the prediction.<br />
<br />
[[File:FRCL-GP-example.jpg|500px|center]]<br />
<br />
== Methods ==<br />
<br />
Consider a deep neural network in which the final hidden layer provides the feature vector <math>\phi(x;\theta)\in \mathbb{R}^K</math>, where <math>x</math> is the input data and <math>\theta</math> are the task-shared model parameters. Importantly, let's assume the task boundaries are known. That is, we know when the input data is switched to a new task. Taking the NLP model as an example, this is equivalent to assuming we know whether each batch of data belongs to English, French, or German dataset. This assumption is important because it allows us to know when to update the task-shared parameter <math>\theta</math>. The authors also discussed how to detect task boundaries when they are not given, which will be presented later in this summary.<br />
<br />
For each specific task <math>i</math>, an output layer is constructed as <math>f_i(x;w_i) = w_i^T\phi(x;\theta)</math>, where <math>w_i</math> is the task-specific weight. By assuming that the weight <math>w_i</math> follows a normal distribution <math>w_i\sim \mathcal{N}(0, \sigma_w^2I)</math>, we obtain a distribution over functions:<br />
\[f_i(x) \sim \mathcal{GP}(0, k(x,x')), \]<br />
where <math>k(x,x') = \sigma_w^2 \phi(x;\theta)^T\phi(x';\theta)</math>. We can express our posterior belief over <math>f_i(x)</math> instead of <math>w_i</math>. Namely, we are interested in estimating<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \sim p(\boldsymbol{f}_i|\boldsymbol{y}_i, X_i),\]<br />
where <math>X_i = \{x_{i,j}\}_{j=1}^{N_i}</math> are input vectors and <math>\boldsymbol{y}_i = \{y_{i,j}\}_{j=1}^{N_i}</math> are output targets so that each <math> y_{i,j} </math> is assigned to the input <math>x_{i,j} \in R^D</math>. However, in practice the following approxiation is used:<br />
<br />
\[\boldsymbol{f}_i|\text{Data} \overset{approx.}{\sim} \mathcal{N}(\boldsymbol{f}_i|\mu_i, \Sigma_i),\]<br />
Instead of having fixed model weight <math>w_i</math>, we now have a distribution for it, which depends on the input data. Then we can summarize information acquired from a task by the estimated distribution of the weights, or equivalently, the distribution of the output functions that depend on the weights. However, we are facing the computational challenge of storing <math>\mathcal{O}(N_i^2)</math> parameters and keeping in memory the full set of input vector <math>X_i</math>. To see this, note that the <math>\Sigma_i</math> is a <math>N_i \times N_i</math> matrix. Hence, the authors tackle this problem by using the Sparse Gaussian process with inducing points, which is introduced as follows.<br />
<br />
'''Inducing Points''': <math>Z_i = \{z_{i,j}\}_{j=1}^{M_i}</math>, which can be a subset of <math>X_i</math> (the <math>i</math>-th training inputs) or points lying between the training inputs.<br />
<br />
'''Auxiliary function''': <math>\boldsymbol{u}_i</math>, where <math>u_{i,j} = f(z_{i,j})</math>. <br />
<br />
We typically choose the number of inducing points to be a lot smaller than the number of training data. The idea behind the inducing point method is to replace <math>\boldsymbol{f}_i</math> by the auxiliary function <math>\boldsymbol{u}_i</math> evaluated at the inducing inputs <math>Z_i</math>. Intuitively, we are also assuming the inducing inputs <math>Z_i</math> contain enough information to make inference about the "true" <math>\boldsymbol{f}_i</math>, so we can replace <math>X_i</math> by <math>Z_i</math>. <br />
<br />
Now we can introduce how to learn the first task when no previous knowledge has been acquired.<br />
<br />
=== Learning the First Task ===<br />
<br />
In learning the first task, the goal is to generate the first posterior belief given this task: <math>p(\boldsymbol{u}_1|\text{Data})</math>. We learn this distribution by approximating it by a variational distribution: <math>q(\boldsymbol{u}_1)</math>. That is, <math>p(\boldsymbol{u}_1|\text{Data}) \approx q(\boldsymbol{u}_1)</math>. We can parametrize <math>q(\boldsymbol{u}_1)</math> as <math>\mathcal{N}(\boldsymbol{u}_1 | \mu_{u_1}, L_{u_1}L_{u_1}^T)</math>, where <math>L_{u_1}</math> is the lower triangular Cholesky factor. I.e., <math>\Sigma_{u_1}=L_{u_1}L_{u_1}^T</math>. Next, we introduce how to estimate <math>q(\boldsymbol{u}_1)</math>, or equivalently, <math>\mu_{u_1}</math> and <math>L_{u_1}</math>, using variational inference.<br />
<br />
Given the first task with data <math>(X_1, \boldsymbol{y}_1)</math>, we can use a variational distribution <math>q(\boldsymbol{f}_1, \boldsymbol{u}_1)</math> that approximates the exact posterior distribution <math>p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1)</math>, where<br />
\[q(\boldsymbol{f}_1, \boldsymbol{u}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1)\]<br />
\[p(\boldsymbol{f}_1, \boldsymbol{u}_1| \boldsymbol{y}_1) = p_\theta(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_\theta(\boldsymbol{u}_1|\boldsymbol{y}_1).\]<br />
Note that we use <math>p_\theta(\cdot)</math> to denote the Gaussian distribution form with kernel parametrized by a common <math>\theta</math>.<br />
<br />
Hence, we can jointly learn <math>q(\boldsymbol{u}_1)</math> and <math>\theta</math> by minimizing the KL divergence <br />
\[\text{KL}(p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1)q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{f}_1|\boldsymbol{u}_1, \boldsymbol{y}_1)p_{\theta}(\boldsymbol{u}_1|\boldsymbol{y}_1)),\]<br />
which is equivalent to maximizing the Evidence Lower Bound (ELBO)<br />
\[\mathcal{F}({\theta}, q(\boldsymbol{u}_1)) = \sum_{j=1}^{N_1} \mathbb{E}_{q(f_1, j)}[\log p(y_{1,j}|f_{1,j})]-\text{KL}(q(\boldsymbol{u}_1) \ || \ p_{\theta}(\boldsymbol{u}_1)).\]<br />
<br />
=== Learning the Subsequent Tasks ===<br />
<br />
After learning the first task, we only keep the inducing points <math>Z_1</math> and the parameters of <math>q(\boldsymbol{u}_1)</math>, both of which act as a task summary of the first task. Note that <math>\theta</math> also has been updated based on the first task. In learning the <math>k</math>-th task, we can use the posterior belief <math>q(\boldsymbol{u}_1), q(\boldsymbol{u}_2), \ldots, q(\boldsymbol{u}_{k-1})</math> obtained from the preceding tasks to regularize the learning, and produce a new task summary <math>(Z_k, q(\boldsymbol{u}_k))</math>. Similar to the first task, now the objective function to be maximized is<br />
\[\mathcal{F}(\theta, q(\boldsymbol{u}_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|f_{k,j})]-<br />
\text{KL}(q(\boldsymbol{u}_k) \ || \ p_{\theta}(\boldsymbol{u}_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
As a result, we regularize the learning of a new task by the sum of KL divergence terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math>, where each <math>q(\boldsymbol{u}_i)</math> encodes the knowledge about an earlier task <math>i < k</math>. To make the optimization computationally efficient, we can sub-sample the KL terms in the sum and perform stochastic approximation over the regularization term.<br />
<br />
=== Alternative Inference for the Current Task ===<br />
<br />
Given this framework of sparse GP inference, the author proposed a further improvement to obtain more accurate estimates of the posterior belief <math>q(\boldsymbol{u}_k)</math>. That is, performing inference over the current task in the weight space. Due to the trade-off between accuracy and scalability imposed by the number of inducing points, we can use a full Gaussian viariational approximation <br />
\[q(w_k) = \mathcal{N}(w_k|\mu_{w_k}, \Sigma_{w_k})\]<br />
by letting <math>f_k(x; w_k) = w_k^T \phi(x; \theta)</math>, <math>w_k \sim \mathcal{N}(0, \sigma_w^2 I)</math>. <br />
The objective becomes<br />
\[\mathcal{F}(\theta, q(w_k)) = \underbrace{\sum_{j=1}^{N_k} \mathbb{E}_{q(f_k, j)}[\log p(y_{k,j}|w_k^T \phi(x_{k,j}; \theta))]-<br />
\text{KL}(q(w_k) \ || \ p(w_k))}_{\text{objective for the current task}} - \underbrace{\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_{\theta}(\boldsymbol{u}_i)))}_{\text{regularization from previous tasks}}\]<br />
<br />
After learning <math>\mu_{w_k}</math> and <math>\Sigma_{w_k}</math>, we can also compute the posterior distribution over their function values <math>\boldsymbol{u}_k</math> according to <math>q(\boldsymbol{u}_k) = \mathcal{N}(\boldsymbol{u}_k|\mu_{u_k}, L_{u_k}L_{u_k}^T</math>), where <math>\mu_{u_k} = \Phi_{Z_k}\mu_{w_k}</math>, <math>L_{u_k}=\Phi_{Z_k}L_{w_k} </math>, and <math>\Phi_{Z_k}</math> stores as rows the feature vectors evaluated at <math>Z_k</math>.<br />
<br />
The figure below is a depiction of the proposed method.<br />
<br />
[[File:FRCL-depiction-approach.jpg|1000px]]<br />
<br />
=== Selection of the Inducing Points ===<br />
<br />
In practice, a simple but effective way to select inducing points is to select a random set <math>Z_k</math> of the training inputs <math>X_k</math>. In this paper, the authors proposed a structured way to select them. The proposed method is an unsupervised criterion that only depends on the training inputs, which quantifies how well the full kernel matrix <math>K_{X_k}</math> can be constructed from the inducing inputs. This is done by minimizing the trace of the covariance matrix of the prior GP conditional <math>p(\boldsymbol{f}_k|\boldsymbol{u}_k)</math>:<br />
\[\mathcal{T}(Z_k)=\text{tr}(K_{X_k} - K_{X_kZ_K}K_{Z_k}^{-1}K_{Z_kX_k}),\]<br />
where <math>K_{X_k} = K(X_k, X_k), K_{X_kZ_K} = K(X_k, Z_k), K_{Z_k} = K(Z_k, Z_k)</math>, and <math>K(\cdot, \cdot)</math> is the kernel function parametrized by <math>\theta</math>. This method promotes finding inducing points <math>Z_k</math> that are spread evenly in the input space. As an example, see the following figure where the final selected inducing points are spread out in different clusters of data. The round dots represents the data points and each color corresponds to a different label.<br />
<br />
[[File:inducing-points-extended.png|centre]]<br />
<br />
=== Prediction ===<br />
<br />
Given a test data point <math>x_{i,*}</math>, we can obtain the predictive density function of its output <math>y_{i,*}</math> given by<br />
\begin{align*}<br />
p(y_{i,*}) &= \int p(y_{i,*}|f_{i,*}) p_\theta(f_{i,*}|\boldsymbol{u}_i)q(\boldsymbol{u}_i) d\boldsymbol{u}_i df_{i,*}\\<br />
&= \int p(y_{i,*}|f_{i,*}) q_\theta(f_{i,*}) df_{i,*},<br />
\end{align*}<br />
where <math>q_\theta(f_{i,*})=\mathcal{N}(f_{i,*}| \mu_{i,*}, \sigma_{i,*}^2)</math> with known mean and variance<br />
\begin{align*}<br />
\mu_{i,*} &= \mu_{u_i}^TK_{Z_i}^{-1} k_{Z_kx_i,*}\\<br />
\sigma_{i,*}^2 &= k(x_{i,*}, x_{i,*}) + k_{Z_ix_i,*}^T K_{Z_i}^{-1}[L_{u_i}L_{u_i}^T - K_{Z_i}] K_{Z_i}^{-1} k_{Z_ix_i,*}<br />
\end{align*}<br />
Note that all the terms in <math>\mu_{i,*}</math> and <math>\sigma_{i,*}^2</math> are either already estimated or depend on some estimated parameters.<br />
<br />
It is important to emphasize that the mean <math>\mu_{i,*}</math> can be further rewritten as <math>\mu_{u_i}^TK_{Z_i}^{-1}\Phi_{Z_i}\phi(x_{i,*};\theta)</math>, which, notably, depends on <math>\theta</math>. This means that the expectation of <math>f_{i,*}</math> changes over time as more tasks are learned, so the overall prediction will not be out of date. In comparison, if we use a distribution of weights <math>w_i</math>, the mean of the distribution will remain unchanged over time, thus resulting in obsolete prediction.<br />
<br />
== Detecting Task Boundaries ==<br />
<br />
In the previous discussion, we have assumed the task boundaries are known, but this assumption is often unrealistic in the practical setting. Therefore, the authors introduced a way to detect task boundaries using GP predictive uncertainty. This is done by measuring the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We can measure the distance between the GP posterior density of a new task data and the prior GP density using symmetric KL divergence. We denote this score by <math>\ell_i</math>, which can be interpreted as a degree of surprise about <math>x_i</math> - the smaller is <math>\ell_i</math> the more surprising is <math>x_i</math>. Before making any updates to the parameter, we can perform a statistical test between the values <math>\{\ell_i\}_{i=1}^b</math> for the current batch and those from the previous batch <math>\{\ell_i^{old}\}_{i=1}^b</math>. A natural choice is Welch's t-test, which is commonly used to compare two groups of data with unequal variance.<br />
<br />
The figure below illustrates the intuition behind this method. With red dots indicating a new task, we can see the GP posterior (green line) reverts back to the prior (purple line) when it encounters the new task. Hence, this explains why a small <math>\ell_i</math> corresponds to a task switch.<br />
<br />
[[File:detecting-boundaries.jpg|700px]]<br />
<br />
== Algorithm ==<br />
<br />
[[File:FRCL-algorithm.jpg|700px]]<br />
<br />
== Experiments ==<br />
<br />
The authors aimed to answer three questions:<br />
<br />
# How does FRCL compare to state-of-the-art algorithms for Continual Learning?<br />
# How does the criterion for inducing point selection affect accuracy?<br />
# When ground truth task boundaries are not given, does the detection method mentioned above succeed in detecting task changes?<br />
<br />
=== Comparison to state-of-the-art algorithms ===<br />
<br />
The proposed method was applied to two MNIST-variation datasets (in Table 1) and the more challenging Omniglot benchmark (in Table 2). They compared the method with randomly selected inducing points, denoted by FRCL(RANDOM), and the method with inducing points optimized using trace criterion, denoted by FRCL(TRACE). The baseline method corresponds to a simple replay-buffer method described in the appendix of the paper. Both tables show that the proposed method gives strong results, setting a new state-of-the-art result on both Permuted-MNIST and Omniglot.<br />
<br />
[[File:FRCL-table1.jpg|700px]]<br />
[[File:FRCL-table2.jpg|750px]]<br />
<br />
=== Comparison of different criteria for inducing points selection ===<br />
<br />
As can be seen from the figure below, the purple box corresponding to FRCL(TRACE) is consistently higher than the others, and in particular, this difference is larger when the number of inducing points is smaller. Hence, a structured selection criterion becomes more and more important when the number of inducing points reduces.<br />
<br />
[[File:FRCL-compare-inducing-points.jpg|700px]]<br />
<br />
=== Efficacy in detecting task boundaries ===<br />
<br />
From the following figure, we can observe that both the mean symmetric KL divergence and the t-test statistic always drop when a new task is introduced. Hence, the proposed method for detecting task boundaries indeed works.<br />
<br />
[[File:FRCL-test-boundary.jpg|700px]]<br />
<br />
== Conclusions ==<br />
<br />
The proposed method unifies both the regularization-based method and the replay/rehearsal method in Continual Learning. It was able to overcome the disadvantages of both methods. Moreover, the Bayesian framework allows a probabilistic interpretation of deep neural networks. From the experiments we can make the following conclusions:<br />
* The proposed method sets new state-of-the-art results on Permuted-MNIST and Omniglot, and is comparable to the existing results on Split-MNIST.<br />
* A structured criterion for selecting inducing points becomes increasingly important with a decreasing number of inducing points.<br />
* The method is able to detect task boundary changes when they are not given.<br />
<br />
Future work can include enforcing a fixed memory buffer where the summary of all previously seen tasks is compressed into one summary. It would also be interesting to investigate the application of the proposed method to other domains such as reinforcement learning.<br />
<br />
== Critiques ==<br />
This paper presents a new way for remembering previous tasks by reducing the KL divergence of variational distribution: <math>q(\boldsymbol{u}_1)</math> and <math>p_\theta(u_1)</math>. The ideas in the paper are interesting and experiments support the effectiveness of this approach. After reading the summary, some points came to my mind as follows:<br />
<br />
The main problem with Gaussian Process is its test-time computational load where a Gaussian Process needs a data matrix and a kernel for each prediction. Although this seems to be natural as Gaussian Process is non-parametric and except for data, it has no source of knowledge, however, this comes with computational and memory costs which makes this difficult to employ them in practice. In this paper, the authors propose to employ a subset of training data namely "Inducing Points" to mitigate these challenges. They proposed to choose Inducing Points either at random or based on an optimization scheme where Inducing Points should approximate the kernel. Although in the paper authors formulate the problem of Inducing Points in their formulation setting, this is not a new approach in the field and previously known as the Finding Exemplars problem. In fact, their formulation is very similar to the ideas in the following paper:<br />
<br />
Elhamifar, Ehsan, Guillermo Sapiro, and Rene Vidal. '''Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery.''' Advances in Neural Information Processing Systems. 2012.<br />
<br />
More precisely the main is difference is that in the current paper kernel matrix and in the mentioned paper dissimilarities are employed to find Exemplars or induced points.<br />
<br />
Moreover, one unanswered question is how to determine the number of examplers as they play an important role in this algorithm.<br />
<br />
Besides, one practical point is replacing the covariance matrix with its Cholesky decomposition. In practice covariance matrices are positive semi-definite in general while to the best of my knowledge Cholesky decomposition can be used for positive definite matrices. Considering this, I am not sure what happens if the Cholesky decomposition is explicitly applied to the covariance matrix.<br />
<br />
Finally, the number of regularization terms <math>\sum_{i=1}^{k-1}\text{KL}(q(\boldsymbol{u}_i) \ || \ p_\theta(\boldsymbol{u}_i))</math> growth linearly with number of tasks, I am not sure how this algorithm works when number of tasks increases. Clearly, apart from computational cost, having many regularization terms can make optimization more difficult.<br />
<br />
The provided experiments seem interesting and quite enough and did a good job highlighting different facets of the paper but it would be better if these two additional results can be provided as well: (1) How well-calibrated are FRCL-based classifiers? (2) How impactful is the hybrid representation for test-time performance?<br />
<br />
== Source Code ==<br />
<br />
https://github.com/AndreevP/FRCL<br />
<br />
== References ==<br />
<br />
[1] Rasmussen, Carl Edward and Williams, Christopher K. I., 2006, Gaussian Processes for Machine Learning, The MIT Press.<br />
<br />
[2] Quinonero-Candela, Joaquin and Rasmussen, Carl Edward, 2005, A Unifying View of Sparse Approximate Gaussian Process Regression, Journal of Machine Learning Research, Volume 6, P1939-1959.<br />
<br />
[3] Snelson, Edward and Ghahramani, Zoubin, 2006, Sparse Gaussian Processes using Pseudo-inputs, Advances in Neural Information Processing Systems 18, P1257-1264.<br />
<br />
[4] Michalis K. Titsias, Variational Learning of Inducing Variables in Sparse Gaussian Processes, 2009, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Volume 5, P567-574. <br />
<br />
[5] Michalis K. Titsias, Jonathan Schwarz, Alexander G. de G. Matthews, Razvan Pascanu, Yee Whye Teh, 2020, Functional Regularisation for Continual Learning with Gaussian Processes, ArXiv abs/1901.11356.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=48942Meta-Learning For Domain Generalization2020-12-02T22:25:13Z<p>G45sharm: /* Final Objective Function */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
This paper proposes a novel meta-learning method for domain generalization. Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
Let <math> S </math> and T be source and target domains in the DG setting, respectively. We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
This supervised learning methodology has been schematically in figure 1.<br />
<br />
[[File:supervisedd.png |center]]<br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
The PACS multi-domain recognition benchmark is exploited to address the object detection task; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Source Code ==<br />
<br />
Four different implementations of this paper are publicly available at [https://paperswithcode.com/paper/learning-to-generalize-meta-learning-for#code link MLDG]<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:supervisedd.png&diff=48941File:supervisedd.png2020-12-02T22:24:12Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization&diff=48940Meta-Learning For Domain Generalization2020-12-02T22:23:52Z<p>G45sharm: /* Method */</p>
<hr />
<div>== Presented by ==<br />
Parsa Ashrafi Fashi<br />
<br />
== Introduction ==<br />
<br />
This paper proposes a novel meta-learning method for domain generalization. Domain Shift problem addresses the problem where a model trained on a data distribution cannot perform well when tested on another domain with a different distribution. Domain Generalization tries to tackle this problem by producing models that can perform well on unseen target domains. Several approaches have been adapted for the problem, such as training a model for each source domain, extracting a domain agnostic representation, and semantic feature learning. Meta-Learning and specifically Model-Agnostic Meta-Learning models, which have been widely adopted recently, are models capable of adapting or generalizing to new tasks and new environments that have never been encountered during training time. Meta-learning is also known as "learning to learn". It aims to enable intelligent agents to take the principles they learned in one domain and apply them to other domains. One concrete meta-learning task is to create a game bot that can quickly master a new game. Hereby defining tasks as domains, the paper tries to overcome the problem in a model-agnostic way.<br />
<br />
== Previous Work ==<br />
There were 3 common approaches to Domain Generalization. The simplest way is to train a model for each source domain and estimate which model performs better on a new unseen target domain [1]. A second approach is to presume that any domain is composed of a domain-agnostic and a domain-specific component. By factoring out the domain-specific and domain-agnostic components during training on source domains, the domain-agnostic component can be extracted and transferred as a model that is likely to work on a new source domain [2]. Finally, a domain-invariant feature representation is learned to minimize the gap between multiple source domains and it should provide a domain-independent representation that performs well on a new target domain [3][4][5].<br />
<br />
== Method ==<br />
Let <math> S </math> and T be source and target domains in the DG setting, respectively. We define a single model parametrized as <math> \theta </math> to solve the specified task. DG aims for training <math> \theta </math> on the source domains, such that it generalizes to the target domains. At each learning iteration we split the original S source domains <math> S </math> into S−V meta-train domains <math> \bar{S} </math> and V meta-test domains <math> \breve{S} </math> (virtual-test domain). This is to mimic real train-test domain-shifts so that over many iterations we can train a model to achieve good generalization in the final-test evaluated on target domains <math>T</math> . <br />
<br />
The paper explains the method based on two approaches; Supervised Learning and Reinforcement Learning.<br />
<br />
=== Supervised Learning ===<br />
<br />
First, <math> l(\hat{y},y) </math> is defined as a cross-entropy loss function. ( <math> l(\hat{y},y) = -\hat{y}log(y) </math>). The process is as follows.<br />
<br />
==== Meta-Train ====<br />
The model is updated on S-V domains <math> \bar{S} </math> and the loss function is defined as: <math> F(.) = \frac{1}{S-V} \sum\limits_{i=1}^{S-V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
In this step the model is optimized by gradient descent like follows: <math> \theta^{\prime} = \theta - \alpha \nabla_{\theta} </math><br />
<br />
==== Meta-Test ====<br />
<br />
In each mini-batch the model is also virtually evaluated on the V meta-test domains <math>\breve{S}</math>. This meta-test evaluation simulates testing on new domains with different statistics, in order to allow learning to generalize across domains. The loss for the adapted parameters calculated on the meta-test domains is as follows: <math> G(.) = \frac{1}{V} \sum\limits_{i=1}^{V} \frac {1}{N_i} \sum\limits_{j=1}^{N_i} l_{\theta^{\prime}}(\hat{y}_j^{(i)}, y_j^{(i)})</math><br />
<br />
The loss on the meta-test domain is calculated using the updated parameters <math>\theta' </math> from meta-train. This means that for optimization with respect to <math>G </math> we will need the second derivative with respect to <math>\theta </math>. <br />
<br />
==== Final Objective Function ====<br />
<br />
Combining the two loss functions, the final objective function is as follows: <math> argmin_{\theta} \; F(\theta) + \beta G(\theta - \alpha F^{\prime}(\theta)) </math>, where <math>\beta</math> represents how much meta-test weighs. Algorithm 1 illustrates the supervised learning approach. <br />
<br />
[[File:ashraf1.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Supervised Learning Approach.</div><br />
<br />
This supervised learning methodology has been schematically in figure 1.<br />
<br />
[[File:supervisedd.png |center|600px]]<br />
<br />
<br />
=== Reinforcement Learning ===<br />
<br />
In application to the reinforcement learning (RL) setting, we now assume an agent with a policy <math> \pi </math> that inputs states <math> s </math> and produces actions <math> a </math> in a sequential decision making task: <math>a_t = \pi_{\theta}(s_t)</math>. The agent operates in an environment and its goal is to maximize its discounted return, <math> R = \sum\limits_{t} \delta^t R_t(s_t, a_t) </math> where <math> R_t </math> is the reward obtained at timestep <math> t </math> under policy <math> \pi </math> and <math> \delta </math> is the discount factor. What we have in supervised learning as tasks map to reward functions here and domains map to solving the same task (reward function) in a different environments. Therefore, domain generalization achieves an agent that is able to perform well even at new environments without any initial learning.<br />
==== Meta-Train ==== <br />
In meta-training, the loss function <math> F(·) </math>now corresponds to the negative discounted return <math> -R </math> of policy <math> \pi_{\theta} </math>, averaged over all the meta-training environments in <math> \bar{S} </math>. That is, <br />
\begin{align}<br />
F = \frac{1}{|\bar{S}|} \sum_{s \in \bar{S}} -R_s<br />
\end{align}<br />
<br />
Then the optimal policy is obtained by minimizing <math> F </math>.<br />
<br />
==== Meta-Test ====<br />
The step is like a meta-test of supervised learning and loss is again negative of return function. For RL calculating this loss requires rolling out the meta-train updated policy <math> \theta' </math> in the meta-test domains to collect new trajectories and rewards. The reinforcement learning approach is also illustrated completely in algorithm 2.<br />
[[File:ashraf2.jpg |center|600px]]<br />
<br />
<div align="center">Algorithm 1: MLDG Reinforcement Learning Approach.</div><br />
<br />
==== Alternative Variants of MLDG ====<br />
The authors propose different variants of MLDG objective function. For example the so-called MLDG-GC is one that normalizes the gradients upon update to compute the cosine similarity. It is given by:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) + \beta G(\theta) - \beta \alpha \frac{F'(\theta) \cdot G'(\theta)}{||F'(\theta)||_2 ||G'(\theta)||_2}.<br />
\end{equation}<br />
<br />
Another one stops the update of the parameters after the meta-train has converged. This intuition gives the following objective function called MLDG-GN:<br />
\begin{equation}<br />
\text{argmin}_\theta F(\theta) - \beta ||G'(\theta) - \alpha F'(\theta)||_2^2<br />
\end{equation}.<br />
<br />
== Experiments ==<br />
<br />
The Proposed method is exploited in 4 different experiment results (2 supervised and 2 reinforcement learning experiments). <br />
<br />
=== Illustrative Synthetic Experiment ===<br />
<br />
In this experiment, nine domains by sampling curved deviations are synthesized from a diagonal line classifier. We treat eight of these as sources for meta-learning and hold out the last for the final test. Fig. 1 shows the nine synthetic domains which are related in form but differ in the details of their decision boundary. The results show that MLDG performs near perfect and the baseline model without considering domains overfits in the bottom left corner. The baselines for this experiment, as can be seen in Fig. 1, were MLP-All, MLDG, MLDG-GC, and MLDG-GN.<br />
<br />
[[File:ashraf3.jpg |center|600px]]<br />
<br />
<div align="center">Figure 1: Synthetic experiment illustrating MLDG.</div><br />
<br />
=== Object Detection === <br />
The PACS multi-domain recognition benchmark is exploited to address the object detection task; a dataset designed for the cross-domain recognition problems. This dataset has 7 categories (‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘house’, ‘horse’ and ‘person’) and 4 domains of different stylistic depictions (‘Photo’, ‘Art painting’, ‘Cartoon’ and ‘Sketch’). The diverse depiction styles provide a significant domain gap. The Result of the Current approach compared to other approaches is presented in Table 1. The baseline models are D-MTAE[5],Deep-All (Vanilla AlexNet)[2], DSN[6]and AlexNet+TF[2]. On average, the proposed method outperforms other methods. <br />
<br />
[[File:ashraf4.jpg |center|800px]]<br />
<br />
<div align="center">Table 1: Cross-domain recognition accuracy (Multi-class accuracy) on the PACS dataset. Best performance in bold. </div><br />
<br />
=== Cartpole ===<br />
<br />
The objective is to balance a pole upright by moving a cart. The action space is discrete – left or right. The state has four elements: the position and velocity of the cart and the angular position and velocity of the pole. There are two sub-experiments designed. In the first one, the domain factor is varied by changing the pole length. They simulate 9 domains with pole lengths. In the second they vary multiple domain factors – pole length and cart mass. In both experiments, we randomly choose 6 source domains for training and hold out 3 domains for (true) testing. Since the game can last forever, if the pole does not fall, we cap the maximum steps to 200. The result of both experiments is presented in Tables 2 and 3. The baseline methods are RL-All (Trains a single policy by aggregating the reward from all six source domains) RL-Random-Source (trains on a single randomly selected source domain) and RL-undo-bias: Adaptation of the (linear) undo-bias model of [7]. The proposed MLDG outperforms the baselines.<br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 2: Cart-Pole RL. Domain generalisation performance across pole length. Average reward testing on 3 held out domains with random lengths. Upper bound: 200. </div><br />
<br />
[[File:ashraf5.jpg |center|800px]]<br />
<br />
<div align="center">Table 3: Cart-Pole RL. Generalization performance across both pole length and cart mass. Return testing on 3 held out domains with random length and mass. Upper bound: 200. </div><br />
<br />
=== Mountain Car ===<br />
<br />
In this classic RL problem, a car is positioned between two mountains, and the agent needs to drive the car so that it can hit the peak of the right mountain. The difficulty of this problem is that the car engine is not strong enough to drive up the right mountain directly. The agent has to figure out a solution of driving up the left mountain to first generate momentum before driving up the right mountain. The state observation in this game consists of two elements: the position and velocity of the car. There are three available actions: drive left, do nothing, and drive right. Here the baselines are the same as Cartpole. The model doesn't outperform the RL-undo-bias but has a close return value. The results are shown in Table 4.<br />
<br />
[[File:ashraf7.jpg |center|800px]]<br />
<br />
<div align="center">Table 4: Domain generalisation performance for mountain car. Failure rate (↓) and reward (↑) on held-out testing domains with random mountain heights. </div><br />
<br />
== Conclusion ==<br />
<br />
This paper proposed a model-agnostic approach to domain generalization. Unlike prior model-based domain generalization approaches, it scales well with the number of domains and it can also be applied to different Neural Network models. Experimental evaluation shows state-of-the-art results on a recent challenging visual recognition benchmark and promising results on multiple classic RL problems.<br />
<br />
== Source Code ==<br />
<br />
Four different implementations of this paper are publicly available at [https://paperswithcode.com/paper/learning-to-generalize-meta-learning-for#code link MLDG]<br />
<br />
== Critiques ==<br />
<br />
I believe that the meta-learning-based approach (MLDG) extending MAML to the domain generalization problem might have some limitation problems. The objective function of MAML is more applicable for fast task adaptation even it can be shown from the presented tasks in the paper. Also, in the generalization, we do not have access to samples from a new domain, so the MAML-like objective might lead to sub-optimal, as it is highly abstracted from the feature representations. In addition to this, it is hard to scale MLDG to deep architectures like Resnet as it requires differentiating through k iterations of optimization updates, which will lead to some limitations, so I would believe it will be more effective in task networks as it is much shallower than the feature networks.<br />
<br />
<br />
Why meta-learning makes the domain generalization to be domain agnostic? <br />
<br />
In the case that we have four domains, do we randomly pick two domains for meta-train and one for meta-test? if affirmative, because we select two domains out of the three for the meta train, it is likely to have similar meta-train domains between episodes, right?<br />
<br />
The paper would have benefited from demonstrating the strength of the MLDG in terms of embedding space in lower dimensions (TSNE, UMAP) for PACS and other datasets. It is unclear how well the algorithm would have performed domain agnostically on these datasets.<br />
<br />
== References ==<br />
<br />
[1]: [Xu et al. 2014] Xu, Z.; Li, W.; Niu, L.; and Xu, D. 2014. Exploiting low-rank structure from latent domains for domain generalization. In ECCV.<br />
<br />
[2]: [Li et al. 2017] Li, D.; Yang, Y.; Song, Y.-Z.; and Hospedales, T. 2017. Deeper, broader, and artier domain generalization. In ICCV.<br />
<br />
[3]: [Muandet, Balduzzi, and Scholkopf 2013] ¨ Muandet, K.; Balduzzi, D.; and Scholkopf, B. 2013. Domain generalization via invariant feature representation. In ICML.<br />
<br />
[4]: [Ganin and Lempitsky 2015] Ganin, Y., and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML.<br />
<br />
[5]: [Ghifary et al. 2015] Ghifary, M.; Bastiaan Kleijn, W.; Zhang, M.; and Balduzzi, D. 2015. Domain generalization for object recognition with multi-task autoencoders. In ICCV.<br />
<br />
[6]: [Bousmalis et al. 2016] Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; and Erhan, D. 2016. Domain separation networks. In NIPS.<br />
<br />
[7]: [Khosla et al. 2012] Khosla, A.; Zhou, T.; Malisiewicz, T.; Efros, A. A.; and Torralba, A. 2012. Undoing the damage of dataset bias. In ECCV.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:related_work_equations.png&diff=48928File:related work equations.png2020-12-02T21:54:59Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48926Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T21:54:11Z<p>G45sharm: </p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrized sampler. In some works, regularization (by noise addition, penalizing gradients, spectral normalization) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work. The following equation show how this method does not always sample from the generator but a small proportion (with probability p) of the samples come from real examples.<br />
<br />
<div align="center">[[File:related_work_equations.png]]</div><br />
<br />
Early works showed incorporation of Fisher Information to measure similarity and this was extended to use Fisher Vector representations in case of images. Recently, Fisher Information has been used for meta learning as well. This paper explores the possibility of using Fisher Information in deep learning generative models. By utilizing the generator as a sampler, Fisher Information can be computed even from an un-normalized density model.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48924Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T21:53:02Z<p>G45sharm: /* Related Work */</p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrized sampler. In some works, regularization (by noise addition, penalizing gradients, spectral normalization) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work. The following equation show how this method does not always sample from the generator but a small proportion (with probability p) of the samples come from real examples.<br />
<br />
Early works showed incorporation of Fisher Information to measure similarity and this was extended to use Fisher Vector representations in case of images. Recently, Fisher Information has been used for meta learning as well. This paper explores the possibility of using Fisher Information in deep learning generative models. By utilizing the generator as a sampler, Fisher Information can be computed even from an un-normalized density model.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning&diff=48920Adversarial Fisher Vectors for Unsupervised Representation Learning2020-12-02T21:45:37Z<p>G45sharm: </p>
<hr />
<div>== Presented by ==<br />
Sobhan Hemati<br />
<br />
== Introduction ==<br />
<br />
Generative adversarial networks (GANs) are among the most important generative models, where discriminators and generators compete with each other to solve a minimax game. Based on the original GAN paper, when the training is finished and Nash Equilibrium is reached, the discriminator is nothing but a constant function that assigns a score of 0.5 everywhere. This means that in this setting discriminator is nothing more than a tool to train the generator. Furthermore, the generator in traditional GAN models the data density in an implicit manner, while in some applications we need to have an explicit generative model of data. Recently, it has been shown that training an energy-based model (EBM) with a parameterized variational distribution is also a minimax game similar to the one in GAN. Although they are similar, an advantage of this EBM view is that unlike the original GAN formulation, the discriminator itself is an explicit density model of the data.<br />
<br />
Considering some remarks, the authors in this paper show that an energy-based model can be trained using a similar minimax formulation in GANs. After training the energy-based model, they use Fisher Score and Fisher Information (which are calculated based on derivative of the generative models w.r.t its parameters) to evaluate the power of discriminator in modeling the data distribution. More precisely, they calculate normalized Fisher Vectors and Fisher Distance measure using the discriminator's derivative to estimate similarities both between individual data samples and between sets of samples. They name these derived representations Adversarial Fisher Vectors (AFVs). In fact, Fisher vector is a powerful representation that can be calculated using EBMs thanks to the fact that in this EBM model, the discriminator itself is an explicit density model of the data. Fisher vector can be used for setting representation problems which is a challenging task. In fact, as we will see, we can use the Fisher kernel to calculate the distance between two sets of images which is not a trivial task. The authors find several applications and attractive characteristics for AFV as pre-trained features such as:<br />
<br />
* State-of-the-art performance for unsupervised feature extraction and linear classification tasks.<br />
* Using the similarity function induced by the learned density model as a perceptual metric that correlates well with human judgments.<br />
* Improved training of GANs through monitoring (AFV metrics) and stability (MCMC updates) which is a difficult task in general.<br />
* Using AFV to estimate the distance between sets which allows monitoring the training process. More precisely, the Fisher Distance between the set of validation examples and generated examples can effectively capture the existence of overfitting.<br />
<br />
== Background == <br />
===Generative Adversarial Networks===<br />
GANs are a clever way of training a generative model by framing the problem as a supervised learning problem with two sub-models: the generator model that we train to generate new examples, and the discriminator model that tries to classify examples as either real (from the domain) or fake (generated). The weights of generator and discriminator are updated by solving the following optimization problem:<br />
\begin{equation}<br />
\underset{G}{\text{max}} \ \underset{D}{\text{min}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]- E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[-\log (1-D(G(\mathbf{z})))]<br />
\tag{1}<br />
\label{1}<br />
\end{equation}<br />
<br />
Where <math> p_{data(\mathbf{x})} </math>, <math> D(x) </math>, and <math> G(x) </math> are distribution of data, discriminator, and generator respectively. To optimize the above problem, in the inner loop <math> D </math> is trained until convergence given <math> G </math>, and in the outer loop <math> G </math>, is updated one step given <math> D </math>.<br />
<br />
===GANs as variational training of deep EBMs===<br />
An energy-based model (EBM) is a form of generative model (GM) that learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Let an energy-based model define a density function <math> p_{E}(\mathbf{x}) </math> as <math> \frac{e^{-E(\mathbf{x})}}{ \int_{\mathbf{x}} e^{-E(\mathbf{x})} \,d\mathbf{x} } </math>. Then, the negative log likelihood (NLL) of the <math> p_{E}(\mathbf{x}) </math> can be written as<br />
<br />
\begin{equation}<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log \int_{\mathbf{x}} q(\mathbf{x}) \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}\,d\mathbf{x} =<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ \log E_{\mathbf{x} \sim q(\mathbf{x})}[\frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}] \geq \\<br />
E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]+ E_{\mathbf{x} \sim q(\mathbf{x})}[\log \frac{e^{-E(\mathbf{x})}} {q(\mathbf{x})}]= E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[E(\mathbf{x})]- E_{\mathbf{x} \sim q(\mathbf{x})}[E(\mathbf{x})] + H(q)<br />
\tag{2}<br />
\label{2}<br />
\end{equation}<br />
<br />
where <math> q(x) </math> is an auxiliary distribution which is called the variational distribution and <math>H(q) </math> defines its entropy. Here Jensen’s inequality was used to obtain the variational lower bound on the NLL given <math>H(q) </math>. This bound is tight if <math> q(x) \propto e^{-E(\mathbf{x})} \ \forall \mathbf{x}, </math> which means <math> q(x) = p_{E}(\mathbf{x}) </math>. In this case, if we put <math> D(\mathbf{x})= -E(\mathbf{x}) </math> and also <math> q(x)= p_{G}(\mathbf{x}) </math>, Eq.\ref{2} turns to the following problem:<br />
<br />
<br />
<br />
\begin{equation}<br />
\underset{D}{\text{min}} \ \underset{G}{\text{max}} \ E_{\mathbf{x} \sim p_{data(\mathbf{x})}}[-\log D(\mathbf{x})]+ E_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})}[\log (D(G(\mathbf{z})))] +H(p_{G})<br />
\tag{3}<br />
\label{3}<br />
\end{equation}<br />
<br />
<br />
where in the problem, the variational lower bound is maximized w.r.t. <math> p_{G}</math>; the energy model then is updated one step to decrease the NLL with the optimal <math> p_{G}</math> (see Figure1). [[File:Fig1.png]]<br />
<br />
Equations \ref{3} and \ref{1} are similar in the sense that both taking the form of a minimax game between <math> D </math> and <math> G </math>. However, there are 3 major differences:<br />
<br />
*The entropy regularization term <math> H(p_{G})</math> in Eq. \ref{3} prevents the generator from collapsing (although in practice, it is difficult to come up with a differentiable approximation to the entropy term <math> H(p_{G})</math> and instead heuristic regularization methods such as Batch Normalization are used).<br />
* The order of optimizing <math> D </math> and <math> G </math> is different.<br />
* More importantly, <math> D </math> is a density model for the data distribution and <math> G </math> learns to sample from <math> D </math>.<br />
<br />
== Methodology==<br />
===Adversarial Fisher Vectors===<br />
As it was mentioned, one of the most important advantages of an EBM GAN compared with traditional ones is that discriminator is a dual form of the generator. This means that the discriminator can define a distribution that matches the training data. Generally, there is a straightforward way to evaluate the quality of the generator and inspect the quality of produced samples. However, when it comes to discriminator, this is not clear how to evaluate or use a discriminator trained in minimax scheme. To evaluate and also employ discriminator of the GAN, the authors in this paper propose to employ the theory of Fisher Information. This theory was proposed with the motivation of making connections between two different types of models used in machine learning i.e, generative and discriminator models. Given a density model <math> p_{\theta}(\mathbf{x})</math> where <math> \mathbf{x} \in R^d </math> and <math> \theta </math> are input and model parameters, the fisher score of an example <math> \mathbf{x} </math> is defined as <math> U_\mathbf{x}=\nabla_{\theta} \log p_{\theta}(\mathbf{x}) </math>. This gradient maps an example <math> \mathbf{x} </math> into a feature vector that is a point in the gradient space of the manifold. Intuitively, This gradient <math> U_\mathbf{x} </math> can be used to define the direction of steepest ascent in <math> \log p(\mathbf{x}|\theta) </math> for the example <math> \mathbf{x} </math> along the manifold. In other words, The Fisher<br />
Score encodes the desired change of model parameters to better fit the example <math> \mathbf{x} </math>. The authors define the Fisher Information as <math> I=E_{\mathbf{x} \sim} p_{\theta}(\mathbf{x}) [U_\mathbf{x} U_\mathbf{x}^T]</math>. Having Fisher Information and Fisher Score, one can then map an example <math> \mathbf{x} </math> from feature space to the model space, and measure the proximity between two examples <math> \mathbf{x} </math>; <math> \mathbf{y} </math> by <math> U_\mathbf{x}^T I^{-1} U_\mathbf{y}</math>. The metric distance based on this proximity is defined as <math> (U_\mathbf{x}-U_\mathbf{y})^T I^{-1} (U_\mathbf{x}-U_\mathbf{y})</math>. This metric distance is called Fisher distance and easily can be generalized to measure distance between two sets. Finally, The adversarial Fisher Distance (AFV) is defined as<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=I^{-\frac{1}{2}}U_\mathbf{x}<br />
\end{equation}<br />
<br />
As a result, Fisher Distance is equivalent to the Euclidean distance with AFVs. The fisher vector theory has been using simple generative models like gmms.<br />
In the domain of the EBMs, where the density model is parameterized as <math> p_\theta(\mathbf{x})= \frac{e^{-D(\mathbf{x},\theta)}}{\int_{\mathbf{x}} e^{-D(\mathbf{x},\theta)} \,d\mathbf{x}} </math> and <math> \theta </math> are parameters of <math> D</math>, the fisher score is derived as<br />
<br />
<br />
<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - \nabla_{\theta} \log \int_{\mathbf{x}} e^{D(\mathbf{x},\theta)} \,d\mathbf{x}= \nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{x} \sim p_\theta(\mathbf{x})} \nabla_{\theta} D(\mathbf{x};\theta).<br />
\tag{4}<br />
\label{4}<br />
\end{equation}<br />
As we know, in an EBM GAN, the generator is updated during the training to match the distribution of <math> p_G(\mathbf{x}) </math> to <math> p_\theta(\mathbf{x})</math>. This allows us to approximate the second term in Eq.\ref{4} by sampling form generator's distribution which let us to compute the Fisher Information and Fisher Score in EBM GAN as follow:<br />
<br />
\begin{equation}<br />
U_\mathbf{x}=\nabla_{\theta} D(\mathbf{x};\theta) - E_{\mathbf{z} \sim p(\mathbf{z})} \nabla_{\theta} D(G(\mathbf{z});\theta), \quad I= E_{\mathbf{z} \sim p(\mathbf{z})}[U_{G(\mathbf{z})} U^T_{G(\mathbf{z})}]<br />
\tag{5}<br />
\label{5}<br />
\end{equation}<br />
<br />
Finally, having Fisher Score and Fisher Information, we use the following approximation to calculate AFV:<br />
<br />
<br />
\begin{equation}<br />
V_\mathbf{x}=\mbox{diag}(I)^{-\frac{1}{2}}U_\mathbf{x}<br />
\tag{6}<br />
\label{6}<br />
\end{equation}<br />
<br />
Remember that by using Fisher Score, we transform data from feature space to the parameter space which means that the dimensionality of the vectors can easily be up to millions. As a result, replacing <math> I </math> with <math>\mbox{diag}(I) </math> is an attempt to reduce the computational load of calculating final AFV.<br />
<br />
===Generator update as stochastic gradient MCMC===<br />
The use of a generator provides an efficient way of drawing samples from the EBM. However, in practice, great care needs to be taken to make sure that G is well conditioned to produce examples that cover enough modes of D. There is also a related issue where the parameters of G will occasionally undergo sudden changes, generating samples drastically different from iteration to iteration, which contributes to training instability and lower model quality.<br />
<br />
In light of these issues, they provide a different treatment of G, borrowing inspirations from the Markov chain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of EBM's, which can be used to sample from an unnormalized density and approximate the partition function. Stochastic gradient MCMC is of particular interest as it utilizes the gradient of the log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples(while adding noise to the gradients). See for a recent application of this technique to deepEBMs. We speculate that it is possible to train G to mimic the stochastic gradient MCMC update rule, such that the samples produced by G will approximate the true model distribution.<br />
<br />
== Related Work ==<br />
There are many variants of GAN method that use a discriminator as a critic to differentiate given distributions. Examples of such variants are Wasserstein GAN, f-GAN and MMD-GAN. There is a resemblance between the training procedure of GAN and deep EBM (with variational inference) but the work present in the paper is different as its discriminator directly learns the target distribution. The implementation of EBM presented in the paper directly learns the parametrized sampler. In some works, regularization (by noise addition, penalizing gradients, spectral normalization) has been introduced to make GAN more stable. But these additions do not have any formal justification. This paper connects the MCMC based G update rule with the gradient penalty line of work.<br />
<br />
== Experiments ==<br />
===Evaluating AFV representations===<br />
As it was pointed out, the main advantage of the EBM GANs is their powerful discriminator, which can learn a density function that characterizes the data manifold of the training data. To evaluate how good the discriminator learns the data distribution, authors proposed to use Fisher Information theory. To do this, authors trained some models under different models and employed the discriminator to extract AFVs and then use these vectors for unsupervised pretraining classification task.<br />
Results in Table 1 suggest that AFVs achieve state-of-art performance in unsupervised pretraining classification tasks and comparable with the supervised learning.<br />
<br />
[[File:Table1.png||center]]<br />
<br />
AFVs can also be used to measure distance between a set of data points. Authors took advantage of this point and calculate the semantic distance between classes (all data points of every class) in CIFAR 10. As shown in Figure 2, although the training has been unsupervised, the semantic relation between classes is well estimated. For example, in Figure 2 cars are similar to trucks, dogs are similar to cats.<br />
<br />
[[File:Sobhan_Fig2.jpg||center]]<br />
<br />
<br />
As AFVs transform data from feature space to the parameter space of the generative model and as a result carry information about the data manifold, they are also expected to carry additional fine-grained perceptual information. To evaluate this, authors ran experiments to examine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments. They use the AFV representation to calculate distances between image patches and compare with current methods on the Berkeley-Adobe Perceptual Patch Similarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. They trained a GAN on ImageNet and then calculate AFVs on the BAPPS evaluation set.<br />
Table 2 shows the performance of AFV along with a variety of existing benchmarks. Clearly, AFV exceeds the reported unsupervised and self-supervised methods and is competitive with supervised methods trained on ImageNet.<br />
<br />
[[File:Sobhan_Table2.png||center]]<br />
<br />
An interesting point about AFVs is their robustness to overfitting. AFVs are 3 orders of magnitude higher than those of the existing methods, which would typically bring a higher propensity to overfitting. However, AFVs still show great generalization ability, demonstrating that they are indeed encoding a meaningful low dimensional subspace of original data. Figure 6 visualizes the nearest neighbours.<br />
<br />
[[File:Sobhan_Fig_6.png||center]]<br />
<br />
===Using the Fisher Distance to monitor training===<br />
Training GANs has been a challenging task which is partly because of the lack of reliable metrics. Although recently some domain specific metrics such as Inception Scores and Fréchet Inception Distance have been proposed, they are mainly relied on a discriminative model trained on ImageNet, and thus have limited<br />
applicability to datasets that are drastically different. In this paper, authors the Fisher Distance between the set of real and generated examples to monitor and diagnose the training process. To do this, conducted a set of experiments on CIFAR10 by varying the number of training examples from the set {1000; 5000; 25000; 50000}. Figure 3 shows batch-wise estimate of Inception Score and the "Fisher Similarity". This is clear that for higher number of training examples, the validation Fisher Similarity steadily increases, in the similar trend to the Inception Score. On the other hand, when the number of training examples is small, the validation Fisher Similarity starts decreasing at some point.<br />
<br />
[[File:Sobhan_Fig_3.png||center]]<br />
<br />
<br />
===Interpreting G update as parameterized MCMC===<br />
AFC can only be applied if a generator approximates EBM during the training process. Model is trained on Imagenet with 64X64 along with modification of default architecture with the addition of residual blocks to discriminator and generator. Following figure shows training stats over 80,000 iterations.<br />
<br />
[[File:training 80K.png|600px|center]]<br />
<div align="center">Left: default generator objective. Right: corresponding Inception scores.</div><br />
<br />
== Conclusion ==<br />
In this paper, the authors demonstrated that GANs can be reinterpreted in order to learn representations across a diverse set of tasks without requiring domain knowledge or annotated data. Authors also showed that in an EBM GAN, discriminator can explicitly learn data distribution and capture the intrinsic manifold of data with low error rate. This is especially different from regular GANs where the discriminator is reduced to a constant function once the Nash Equilibrium is reached. To evaluate how well the discriminator estimates data distribution, the authors took advantage of Fisher Information theory. First, they showed that AFVs are a reliable indicator of whether GAN<br />
training is well behaved, and that we can use this monitoring to select good model checkpoints. Second, they illustrated that AFVs are a useful feature representation for linear and nearest neighbour classification, achieving state-of-the-art among unsupervised feature representations and competitive with supervised results on CIFAR-10. <br />
Finally, they showed that a well-trained GAN discriminator does contain useful information for fine-grained perceptual similarity suggesting that AFVs are good candidates for image search. All in all, the conducted experiments show the effectiveness of the EBM GANs coupled with the Fisher Information framework for extracting useful representational features from GANs. <br />
As future work, authors propose to improve the scalability of the AFV method by compressing the Fisher Vector representation, using methods like product quantization.<br />
<br />
== Source Code == <br />
The code for this paper is freely available at [https://github.com/apple/ml-afv link Adversarial Fisher Vectors].<br />
<br />
== Critique == <br />
<br />
This paper has an excellent contribution in feature representation exploiting information theory and GANs. Although it lacked intuitive explanation of the defined formula and how this representation is performing well in classification tasks. Therefore, an "Analysis" section would help the paper to be more readable and understandable.<br />
<br />
== References==<br />
<br />
Jaakkola, Tommi, and David Haussler. "Exploiting generative models in discriminative classifiers." Advances in neural information processing systems. 1999.<br />
<br />
Zhai, Shuangfei, et al. "Adversarial Fisher Vectors for Unsupervised Representation Learning." Advances in Neural Information Processing Systems. 2019.<br />
<br />
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." 2007 IEEE conference on computer vision and pattern recognition. IEEE, 2007.<br />
<br />
Sánchez, Jorge, et al. "Image classification with the fisher vector: Theory and practice." International journal of computer vision 105.3 (2013): 222-245.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48818stat940F212020-12-02T04:14:36Z<p>G45sharm: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks Summary] || [https://youtu.be/SENjFF4N45s video] or Learn<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]|| [https://youtu.be/HkkacHrvloE YouTube]<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48812stat940F212020-12-02T03:58:46Z<p>G45sharm: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks Summary] || [https://youtu.be/4Ud_0NlRzOs video] or Learn<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]|| [https://youtu.be/HkkacHrvloE YouTube]<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48810Time-series Generative Adversarial Networks2020-12-02T03:52:45Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data <sup>[1]</sup>. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks <sup>[2]</sup>. Different methods have been proposed to remedy this including Scheduled Sampling <sup>[1]</sup> where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence<sup>[3][4]</sup>, and Actor-critic methods <sup>[5]</sup> that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN <sup>[6]</sup> try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet <sup>[7]</sup> and its GAN alternative WaveGAN <sup>[8]</sup>. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data) and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated the potential incorporation of Differential Privacy Frameworks into TimeGAN in future in order to produce realistic time sequences with differential privacy guarantees.<br />
<br />
== References ==<br />
<br />
[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.<br />
<br />
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.<br />
<br />
[3] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pages 4601–4609, 2016.<br />
<br />
[4] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.<br />
<br />
[5] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000.<br />
<br />
[6] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.<br />
<br />
[7] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, 125, 2016<br />
<br />
[8] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48804Time-series Generative Adversarial Networks2020-12-02T03:35:03Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence, and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let <math>H_s</math> and <math>H_x</math> denote the latent representations of <math>S</math> and <math>X</math> features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for '''e''' and a feedforward network for '''r'''. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(s, x_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet and its GAN alternative WaveGAN. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data) and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated potential incorporation of Differential Privacy Frameworks into TimeGAN in future in order to produce realistic time sequences with differential privacy guarantees.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48802Time-series Generative Adversarial Networks2020-12-02T03:26:54Z<p>G45sharm: /* Introduction */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive models.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence, and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet and its GAN alternative WaveGAN. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data) and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated potential incorporation of Differential Privacy Frameworks into TimeGAN in future in order to produce realistic time sequences with differential privacy guarantees.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48801Time-series Generative Adversarial Networks2020-12-02T03:24:30Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence, and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet and its GAN alternative WaveGAN. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data) and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div><br />
<br />
== Conclusion ==<br />
Combining the flexibility of GANs and control over conditional temporal dynamics of autoregressive models, TimeGAN shows significant quantitative and qualitative gains for generated time-series data across different varieties of datasets. <br />
<br />
The authors indicated potential incorporation of Differential Privacy Frameworks into TimeGAN in future in order to produce realistic time sequences with differential privacy guarantees.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:gtable2.PNG&diff=48800File:gtable2.PNG2020-12-02T03:23:19Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:pca.PNG&diff=48799File:pca.PNG2020-12-02T03:22:58Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:gtable1.PNG&diff=48798File:gtable1.PNG2020-12-02T03:21:56Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48797Time-series Generative Adversarial Networks2020-12-02T03:21:43Z<p>G45sharm: /* Experiments */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence, and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet and its GAN alternative WaveGAN. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data) and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:gtable1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:gtable2.PNG]] </div><br />
<div align="center">'''Table 2'''</div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48796Time-series Generative Adversarial Networks2020-12-02T03:20:40Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (the previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and an auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence, and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div><br />
<br />
== Experiments ==<br />
In the paper, the authors compare TimeGAN with the two most familiar and related variations of traditional GANs applied to time-series i.e. RCGAN and C-RNN-GAN. To make a comparison with autoregressive approaches, the authors use RNNs trained with T-Forcing and P-Forcing. Additionally, performance comparisons are also made with WaveNet and its GAN alternative WaveGAN. Qualitatively, the generated data is examined in terms of diversity (healthy distribution of sample covering real data), fidelity (samples should be indistinguishable from real data) and usefulness (samples should have the same predictive purposes as real data). <br />
<br />
The following methods are used for benchmarking and evaluation.<br />
<br />
# '''Visualization''': This involves the application of t-SNE and PCA analysis on data (real and synthetic). This is done to compare the distribution of generated data with the real data in 2-D space.<br />
# '''Discriminative Score''': This involves training a post-hoc time-series classification model (an off-the-shelf RNN) to differentiate sequences from generated and original sets. <br />
# '''Predictive Score''': This involves training a post-hoc sequence prediction model to forecast using the generated data and this is evaluated against the real data.<br />
<br />
In the first experiment, the authors used time-series sequences from an autoregressive multivariate gaussian data defined as <math>x_t=\phi x_{t-1}+n</math>, where <math>n \sim N(0, \sigma 1 + (1-\sigma)I)</math>. Table 1 has the results of this experiment performed by different models. The results clearly show how TimeGAN outperforms other methods in terms of both discriminative and predictive scores. <br />
<br />
<div align="center"> [[File:table1.PNG]] </div><br />
<div align="center">'''Table 1'''</div><br />
<br />
Next, the paper has experimented on different types of Time Series Data. Using time-series sequences of varying properties, the paper evaluates the performance of TimeGAN to testify for its ability to generalize over time-series data. The paper uses datasets like Sines, Stocks, Energy and Events with different methods to see their performance. Figure 2 shows t-SNE/PCA visualization comparison for Sines and Stocks and it is clear from the figure that among all different models, TimeGAN shows the best overlap between generated and original data.<br />
<br />
<div align="center"> [[File:pca.PNG]] </div><br />
<div align="center">'''Figure 2'''</div><br />
<br />
Table 2 shows a comparison of predictive and discriminative scores for different methods across different datasets. And TimeGAN outperforms other methods in both scores indicating a better quality of generated synthetic data across different types of datasets. <br />
<br />
<div align="center"> [[File:table2.PNG]] </div><br />
<div align="center">'''Table 2'''</div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT&diff=48790BERTScore: Evaluating Text Generation with BERT2020-12-02T03:00:36Z<p>G45sharm: /* Robustness Analysis */</p>
<hr />
<div>== Presented by == <br />
Gursimran Singh<br />
<br />
== Introduction == <br />
In recent times, various machine learning approaches for text generation have gained popularity. This paper aims to develop an automatic metric that will judge the quality of the generated text. Commonly used state of the art metrics either use n-gram approaches or word embeddings for calculating the similarity between the reference and the candidate sentence. BertScore, on the other hand, calculates the similarity using contextual embeddings. BertScore basically addresses two common pitfalls in n-gram-based metrics. Firstly, the n-gram models fail to robustly match paraphrases which leads to performance underestimation when semantically-correct phrases are penalized because of their difference from the surface form of the reference. On the other hand in BertScore, the similarity is computed using contextualized token embeddings, which have been shown to be effective for paraphrase detection. Secondly, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. In contrast, contextualized embeddings capture distant dependencies and ordering effectively. The authors of the paper have carried out various experiments in Machine Translation and Image Captioning to show why BertScore is more reliable and robust than the previous approaches.<br />
<br />
''' Word versus Context Embeddings '''<br />
<br />
Both models aim to reduce the sparseness invoked by a bag of words (BoW) representation of text due to the high dimensional vocabularies. Both methods create embeddings of a dimensionality much lower than sparse BoW and aim to capture semantics and context. Word embeddings differ in that they will be deterministic as when given a word embedding model will always produce the same embedding, regardless of the surrounding words. However, contextual embeddings will create different embeddings for a word depending on the surrounding words in the given text.<br />
<br />
== Previous Work ==<br />
Previous Approaches for evaluating text generation can be broadly divided into various categories. The commonly used techniques for text evaluation are based on n-gram matching. The main objective here is to compare the n-grams in reference and candidate sentences and thus analyze the ordering of words in the sentences. <br />
The most popular n-Gram Matching metric is BLEU. It follows the underlying principle of n-Gram matching and its uniqueness comes from three main factors. <br><br />
• Each n-Gram is matched at most once. <br><br />
• The total of exact-matches is accumulated for all reference candidate pairs and divided by the total number of <math>n</math>-grams in all candidate sentences. <br><br />
• Very short candidates are restricted. <br><br />
<br />
Further BLEU is generally calculated for multiple <math>n</math>-grams and averaged geometrically.<br />
n-Gram approaches also include METEOR, NIST, ΔBLEU, etc.<br />
<br />
Most of these methods utilize or slightly modify the exact match precision (Exact-<math>P_n</math>) and recall (Exact-<math>R_n</math>) scores. These scores can be formalized as follows:<br />
<br />
Exact- <math> P_n = \frac{\sum_{w \ in S^{n}_{ \hat{x} }} \mathbb{I}[w \in S^{n}_{x}]}{S^{n}_{\hat{x}}} </math> <br />
<br />
Exact- <math> R_n = \frac{\sum_{w \ in S^{n}_{x}} \mathbb{I}[w \in S^{n}_{\hat{x}}]}{S^{n}_{x}} </math> <br />
<br />
Here <math>S^{n}_{x}</math> and <math>S^{n}_{\hat{x}}</math> are lists of token <math>n</math>-grams in the reference <math>x</math> and candidate <math>\hat{x}</math> sentences respectively.<br />
<br />
Other categories include Edit-distance-based Metrics, Embedding-based metrics, and Learned Metrics. Most of these techniques do not capture the context of a word in the sentence. Moreover, Learned Metric approaches also require costly human judgments as supervision for each dataset.<br />
<br />
== Motivation ==<br />
The <math>n</math>-gram approaches like BLEU do not capture the positioning and the context of the word and simply rely on exact matching for evaluation. Consider the following example that shows how BLEU can result in incorrect judgment. <br><br />
Reference: people like foreign cars <br><br />
Candidate 1: people like visiting places abroad <br><br />
Candidate 2: consumers prefer imported cars<br />
<br />
BLEU gives a higher score to Candidate 1 as compared to Candidate 2. This undermines the performance of text generation models since contextually correct sentences are penalized. In contrast, some semantically different phrases are scored higher just because they are closer to the surface form of the reference sentence. <br />
<br />
On the other hand, BERTScore computes similarity using contextual token embeddings. It helps in detecting semantically correct paraphrased sentences. It also captures the cause and effect relationship (A gives B in place of B gives A) that the BLEU score isn't detected.<br />
<br />
== BERTScore Architecture ==<br />
Fig 1 summarizes the steps for calculating the BERTScore. Next, we will see details about each step. Here, the reference sentence is given by <math> x = ⟨x1, . . . , xk⟩ </math> and candidate sentence <math> \hat{x} = ⟨\hat{x1}, . . . , \hat{xl}⟩. </math> <br><br />
<br />
<div align="center"> [[File:Architecture_BERTScore.PNG|Illustration of the computation of BERTScore.]] </div><br />
<div align="center">'''Fig 1'''</div><br />
<br />
=== Token Representation ===<br />
Reference and candidate sentences are represented using contextual embeddings. Word embedding techniques inspire this but in contrast to word embeddings, the contextual embedding of a word depends upon the surrounding words in the sentence. These contextual embeddings are calculated using BERT and other similar models which utilize self-attention and nonlinear transformations.<br />
<br />
<div align="center"> [[File:Pearsson_corr_contextual_emb.PNG|Pearson Correlation for Contextual Embedding]] </div><br />
<div align="center">'''Fig 2'''</div><br />
<br />
=== Cosine Similarity ===<br />
Pairwise cosine similarity is calculated between each token <math> x_{i} </math> in reference sentence and <math> \hat{x}_{j} </math> in candidate sentence. Prenormalized vectors are used, therefore the pairwise similarity is given by <math> x_{i}^T \hat{x_{i}}. </math><br />
<br />
=== BERTScore ===<br />
<br />
Each token in x is matched to the most similar token in <math> \hat{x} </math> and vice-versa for calculating Recall and Precision respectively. The matching is greedy and isolated. Precision and Recall are combined for calculating the F1 score. The equations for calculating Precision, Recall, and F1 Score are as follows<br />
<br />
<div align="center"> [[File:Equations.PNG|Equations for the calculation of BERTScore.]] </div><br />
<br />
<br />
=== Importance Weighting (optional) ===<br />
In some cases, rare words can be highly indicative of sentence similarity. Therefore, Inverse Document Frequency (idf) can be used with the above equations of the BERTScore. This is optional and depending on the domain of the text and the available data it may or may not benefit the final results. Thus understanding more about Importance Weighing is an open area of research.<br />
<br />
=== Baseline Rescaling ===<br />
Rescaling is done only to increase the human readability of the score. In theory, cosine similarity values are between -1 and 1 but practically they are confined in a much smaller range. A value b computed using Common Crawl monolingual datasets is used to linearly rescale the BERTScore. The rescaled recall <math> \hat{R}_{BERT} </math> is given by<br />
<div align="center"> [[File:Equation2.PNG|Equation for the rescaled BERTScore.]] </div><br />
Similarly, <math> P_{BERT} </math> and <math> F_{BERT} </math> are rescaled as well.<br />
<br />
== Experiment & Results ==<br />
The authors have experimented with different pre-trained contextual embedding models like BERT, RoBERTa, etc, and reported the best performing model results. The evaluation has been done on Machine Translation and Image Captioning tasks. <br />
<br />
=== Machine Translation ===<br />
The metric evaluation dataset consists of 149 translation systems, gold references, and two types of human judgments, namely, Segment-level human judgments and System-level human judgments. The former assigns a score to each reference candidate pair and the latter associates a single score for the whole system. Segment-level outputs for BERTScore are calculated as explained in the previous section on architecture and the System-level outputs are calculated by taking an average of BERTScore for every reference-candidate pair. Absolute Pearson Correlation <math> \lvert \rho \rvert </math> and Kendall rank correlation <math> \tau </math> are used for calculating metric quality, Williams test <sup> [1] </sup> for significance of <math> \lvert \rho \rvert </math> and Graham & Baldwin <sup> [2] </sup> methods for calculating the bootstrap resampling of <math> \tau </math>. The authors have also created hybrid systems by randomly sampling one candidate sentence for each reference sentence from one of the systems. This increases the volume of systems for System-level experiments. Further, the authors have also randomly selected 100 systems out of 10k hybrid systems for ranking them using automatic metrics. They have repeated this process multiple times and generated Hits@1, which contains the percentage of the metric ranking agreeing with human ranking on the best system. <br />
<br />
<div align="center"> '''The following 4 tables show the result of the experiments mentioned above.''' </div> <br><br />
<br />
<div align="center"> [[File:Table1_BERTScore.PNG|700px| Table1 Machine Translation]] [[File:Table2_BERTScore.PNG|700px| Table2 Machine Translation]] </div><br />
<div align="center"> [[File:Table3_BERTScore.PNG|700px| Table3 Machine Translation]] [[File:Table4_BERTScore.PNG|700px| Table4 Machine Translation]] </div><br />
<br />
In all 4 tables, we can see that BERTScore is consistently a top performer. It also gives a large improvement over the current state-of-the-art BLEU score. In to-English translation, RUSE shows competitive results but it is a learned metric technique and requires costly human judgments as supervision.<br />
<br />
=== Image Captioning ===<br />
For Image Captioning, human judgment for 12 submission entries from the COCO 2015 Captioning Challenge is used. As per Cui et al. (2018) <sup> [3] </sup>, Pearson Correlation with two System-Level metrics is calculated. The metrics are the percentage of captions better or equal to human captions (M1) and the percentage of captions that are indistinguishable from human captions (M2). There are approximately 5 reference captions and the BERTScore is taken to be the maximum of all the BERTScores individually with each reference caption. BERTScore is compared with 8 task-agnostic metrics and 2 task-specific metrics. <br />
<br />
<div align="center"> [[File:Table5_BERTScore.PNG|450px| Table5 Image Captioning]] </div><br />
<br />
<div align="center"> '''Table 5: Pearson correlation on the 2015 COCO Captioning Challenge.''' </div><br />
<br />
BERTScore is again a top performer and n-gram metrics like BLEU show a weak correlation with human judgments. For this task, importance weighting shows significant improvement depicting the importance of content words. <br />
<br />
'''Speed:''' The time taken for calculating BERTScore is not significantly higher than BLEU. For example, with the same hardware, the Machine Translation test on BERTScore takes 15.6 secs compared to 5.4 secs for BLEU. The time range is essentially small and thus the difference is marginal.<br />
<br />
== Robustness Analysis ==<br />
The authors tested BERTScore's robustness using two adversarial paraphrase classification datasets, QQP and PAWS. The table below summarized the result. Most metrics have a good performance on QQP, but their performance drops significantly on PAWS. Conversely, BERTScore performs competitively on PAWS, which suggests BERTScore is better at distinguishing harder adversarial examples.<br />
<br />
<div align="center"> [[File: bertscore.png | 500px]] </div><br />
<br />
== Source Code == <br />
The code for this paper is available at [https://github.com/Tiiiger/bert_score BERTScore].<br />
<br />
== Critique & Future Prospects==<br />
A text evaluation metric BERTScore is proposed which outperforms the previous approaches because of its capacity to use contextual embeddings for evaluation. It is simple and easy to use. BERTScore is also more robust than previous approaches. This is shown by the experiments carried on the datasets consisting of paraphrased sentences. There are variants of BERTScore depending upon the contextual embedding model, use of importance weighting, and the evaluation metric (Precision, Recall, or F1 score). <br />
<br />
The main reason behind the success of BERTScore is the use of contextual embeddings. The remaining architecture is straightforward in itself. There are some word embedding models that use complex metrics for calculating similarity. If we try to use those models along with contextual embeddings instead of word embeddings, they might result in more reliable performance than the BERTScore.<br />
<br />
<br />
The paper was quite interesting, but it is obvious that they lack technical novelty in their proposed approach. Their method is a natural application of BERT along with traditional cosine similarity measures and precision, recall, F1-based computations, and simple IDF-based importance weighting.<br />
<br />
== References ==<br />
<br />
[1] Evan James Williams. Regression analysis. wiley, 1959.<br />
<br />
[2] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In EMNLP, 2014.<br />
<br />
[3] Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. Learning to evaluate image captioning. In CVPR, 2018.<br />
<br />
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.<br />
<br />
[5] Qingsong Ma, Ondrej Bojar, and Yvette Graham. Results of the WMT18 metrics shared task: Both characters and embeddings achieve good performance. In WMT, 2018.<br />
<br />
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.<br />
<br />
[7] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. arXiv, abs/1907.11692, 2019b.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:supervised_loss.PNG&diff=48788File:supervised loss.PNG2020-12-02T02:58:42Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48787Time-series Generative Adversarial Networks2020-12-02T02:58:29Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:supervised_loss.PNG]] </div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:unsupervised_loss.PNG&diff=48786File:unsupervised loss.PNG2020-12-02T02:58:06Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:recovery_loss.PNG&diff=48784File:recovery loss.PNG2020-12-02T02:57:48Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48783Time-series Generative Adversarial Networks2020-12-02T02:56:27Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math>, we get the first objective function of the reconstruction loss:<br />
<br />
<div align="center"> [[File:recovery_loss.PNG]] </div><br />
<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. <math>h_s</math> and <math>h_{1:t}</math> as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at the generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div><br />
<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
<br />
<div align="center"> [[File:unsupervised_loss.PNG]] </div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48779Time-series Generative Adversarial Networks2020-12-02T02:35:25Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by <math>\tilde{s}</math> and <math>\tilde{x}_{1:t}</math> and scap, we get the first objective function of the reconstruction loss:<br />
Formula.<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. hs and h1:t as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
Formula.<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
Formula</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48777Time-series Generative Adversarial Networks2020-12-02T02:33:17Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
<br />
=== Sequence Generator and Discriminator ===<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors <math>Z_s</math>, <math>Z_x</math> and turns them into a latent representation <math>H_s</math> and <math>H_x</math>. This function is implemented using a recurrent network. <br />
<br />
The discriminator takes as input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.<br />
<br />
=== Architecture Workflow ===<br />
The embedding and recovery functions ought to guarantee an accurate reversible mapping between the feature space and the latent space. After the embedding function turns the original data <math>(S, X_{1:t})</math> into the embedding space i.e. <math>h_s</math>, <math>h_x</math>, the recovery function should be able to reconstruct the original data as accurately as possible from this latent representation. Denoting the reconstructed data by scap and scap, we get the first objective function of the reconstruction loss:<br />
Formula.<br />
The generator component in TimeGAN not only gets the noise vector Z as input but it also gets in autoregressive fashion, its previous output i.e. hs and h1:t as input as well. The generator uses these inputs to produce the synthetic embeddings. The unsupervised gradients when computed are used to decreasing the likelihood at generator and increasing it at the discriminator to provide the correct classification of the produced synthetic output. This is the second objective function in the unsupervised loss form.<br />
Formula.<br />
As mentioned before, TimeGAN does not rely on only the binary feedback from GANs adversarial component i.e. the discriminator. It also incorporates the supervised loss from the embedding and recovery functions into the fold. To ensure that the two segments of TimeGAN interact with each other, the generator is alternatively fed embeddings of actual data instead of its own previous synthetical produced embedding. Maximizing the likelihood of this produces the third objective i.e. the supervised loss:<br />
Formula</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:recovery_formula.PNG&diff=48776File:recovery formula.PNG2020-12-02T02:27:30Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:embedding_formula.PNG&diff=48775File:embedding formula.PNG2020-12-02T02:27:05Z<p>G45sharm: G45sharm uploaded a new version of File:embedding formula.PNG</p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:embedding_formula.PNG&diff=48774File:embedding formula.PNG2020-12-02T02:26:06Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48773Time-series Generative Adversarial Networks2020-12-02T02:25:52Z<p>G45sharm: </p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div><br />
<br />
=== Embedding and Recovery Functions ===<br />
These functions map between the temporal features and their latent representation. This mapping reduces the dimensionality of the original feature space. Let Hs and Hx denote the latent representations of S and X features in the original space. Therefore, the embedding function has the following form.<br />
<br />
<div align="center"> [[File:embedding_formula.PNG]] </div><br />
<br />
And similarly, the recovery function has the following form.<br />
<br />
<div align="center"> [[File:recovery_formula.PNG]] </div><br />
<br />
In the paper, these functions have been implemented using a recurrent network for e and a feedforward network for r. These implementation choices are of course subject to parametrization using any architecture. <br />
Sequence Generator and Discriminator<br />
Coming to the conventional GAN components of TimeGAN, there is a sequence generator and a sequence discriminator. But these do not work on the original space, rather the sequence generator uses the random input noise to generate sequences in the latent space. Thus, the generator takes as input the noise vectors Zs, Zx and turns them into a latent representation Hs and Hx. This function is implemented using a recurrent network. <br />
The discriminator takes input the latent representation from the embedding space and produces its binary classification (synthetic/real). This is implemented using a bidirectional recurrent network with a feedforward output layer.</div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Architecture_TimeGAN.PNG&diff=48772File:Architecture TimeGAN.PNG2020-12-02T02:19:31Z<p>G45sharm: </p>
<hr />
<div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks&diff=48771Time-series Generative Adversarial Networks2020-12-02T02:18:24Z<p>G45sharm: /* Introduction */</p>
<hr />
<div>== Presented By == <br />
Govind Sharma (20817244)<br />
<br />
== Introduction ==<br />
A time-series model should not only be good at learning the overall distribution of temporal features within different time points, but it should also be good at capturing the dynamic relationship between the temporal variables across time.<br />
<br />
The popular autoregressive approach in time-series or sequence analysis is generally focused on minimizing the error involved in multi-step sampling improving the temporal dynamics of data. In this approach, the distribution of sequences is broken down into a product of conditional probabilities. The deterministic nature of this approach works well for forecasting but it is not very promising in a generative setup. The GAN approach when applied on time-series directly simply tries to learn <math>p(X|t)</math> using generator and discriminator setup but this fails to leverage the prior probabilities like in the case of the autoregressive case.<br />
<br />
This paper proposes a novel GAN architecture that combines the two approaches (unsupervised GANs and supervised autoregressive) that allow a generative model to have the ability to preserve temporal dynamics along with learning the overall distribution. This mechanism has been termed as '''Time-series Generative Adversarial Network''' or '''TimeGAN'''. To incorporate supervised learning of data into the GAN architecture, this approach makes use of an embedding network that provides a reversible mapping between the temporal features and their latent representations. The key insight of this paper is that the embedding network is trained in parallel with the generator/discriminator network.<br />
<br />
This approach leverages the flexibility of GANs together with the control of the autoregressive model resulting in significant improvements in the generation of realistic time-series.<br />
<br />
== Related Work ==<br />
The TimeGAN mechanism combines ideas from different research threads in time-series analysis.<br />
<br />
Due to differences between closed-loop training (ground truth conditioned) and open-loop inference (previous guess conditioned), there can be significant prediction error in multi-step sampling in autoregressive recurrent networks. Different methods have been proposed to remedy this including Scheduled Sampling where models are trained to output based on a combination of ground truth and previous outputs, training and auxiliary discriminator that helps separate free-running and teacher-forced hidden states accelerating convergence and Actor-critic methods that condition on target outputs estimating the next-token value that nudges the actor’s free-running predictions. While all these proposed methods try to improve step-sampling, they are still inherently deterministic.<br />
<br />
Direct application of GAN architecture on time-series data like C-RNN-GAN or RCGAN try to generate the time-series data recurrently sometimes taking the generated output from the previous step as input (like in case of RCGAN) along with the noise vector. Recently, adding time stamp information for conditioning has also been proposed in these setups to handle inconsistent sampling. But these approaches remain very GAN-centric and depend only on the traditional adversarial feedback (fake/real) to learn which is not sufficient to capture the temporal dynamics. <br />
<br />
== Problem Formulation ==<br />
Generally, time-series data can be decomposed into two components: static features (variables that remain the same over long or entire stretches of time) and temporal features (variables that change frequently with time steps). The paper uses <math>S</math> to denote the static component and <math>X</math> to denote the temporal features. Using this setting, input to the model can be thought of as a tuple of <math>(S, X_{1:t})</math> that has a joint distribution say <math>p</math>. The objective of a generative model is of course to learn from training data, an approximation of the original distribution <math>p(S, X)</math> i.e. <math>\hat{p}(S, X)</math>. Along with this joint distribution, another objective is to simultaneously learn the autoregressive decomposition of <math>p(S, X_{1:T}) = p(S)\prod_tp(X_t|S, X_{1:t-1})</math> as well. This gives the following two objective functions.<br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(S, X_{1:T})||\hat{p}(S, X_{1:T})\right)</math>, and </div><br />
<br />
<div align="center"><math>min_\hat{p}D\left(p(X_t | S, X_{1:t-1})||\hat{p}(X_t | S, X_{1:t-1})\right)</math></div><br />
<br />
== Proposed Architecture ==<br />
Apart from the normal GAN components of sequence generator and sequence discriminator, TimeGAN has two additional elements: an embedding function and a recovery function. As mentioned before, all these components are trained concurrently. Figure 1 shows how these four components are arranged and how does information flows between them during training in TimeGAN.<br />
<br />
<div align="center"> [[File:Architecture_TimeGAN.PNG|Architecture of TimeGAN.]] </div><br />
<div align="center">'''Figure 1'''</div></div>G45sharmhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat940F21&diff=48762stat940F212020-12-02T01:10:22Z<p>G45sharm: /* Paper presentation */</p>
<hr />
<div>== [[F20-STAT 946-Proposal| Project Proposal ]] ==<br />
<br />
<br />
= Record your contributions here [https://docs.google.com/spreadsheets/d/1Me_O000pNxeTwNGEac57XakecG1wahvwGE5n36DGIlM/edit?usp=sharing]=<br />
<br />
Use the following notations:<br />
<br />
P: You have written a summary/critique on the paper.<br />
<br />
T: You had a technical contribution on a paper (excluding the paper that you present).<br />
<br />
E: You had an editorial contribution on a paper (excluding the paper that you present).<br />
<br />
=Paper presentation=<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="60pt"|Date<br />
|width="100pt"|Name <br />
|width="30pt"|Paper number <br />
|width="700pt"|Title<br />
|width="30pt"|Link to the paper<br />
|width="30pt"|Link to the summary<br />
|width="30pt"|Link to the video<br />
|-<br />
|-<br />
|Sep 15 (example)||Ri Wang || ||Sequence to sequence learning with neural networks.||[http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Going_Deeper_with_Convolutions Summary] || [https://youtu.be/JWozRg_X-Vg?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG&t=539]<br />
|-<br />
|Week of Nov 2 || Jose Avilez || 1|| Gradientless Descent: High-Dimensional Zeroth-Order Optimisation || [https://openreview.net/pdf?id=Skep6TVYDB] || [[GradientLess Descent]] || [https://uofwaterloo-my.sharepoint.com/:v:/g/personal/jlavilez_uwaterloo_ca/ETNogDRpwJlPjSo5o0EY53UBLC7f0zmR9--a0uz6GYN8zw?e=J8V0f3 GLD Presentation] [[File:GradientLessDescent.pdf|Slides]] <br />
|-<br />
|Week of Nov 2 || Abhinav Chanana || 2||AUGMIX: A Simple Data Procession method to Improve Robustness And Uncertainity || [https://openreview.net/pdf?id=S1gmrxHFvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Augmix:_New_Data_Augmentation_method_to_increase_the_robustness_of_the_algorithm#Conclusion Summary] || [[https://youtu.be/epBzlXHFNlY Presentation ]]<br />
|-<br />
|Week of Nov 2 || Maziar Dadbin || 3|| ALBERT: A Lite BERT for Self-supervised Learning of Language Representations || [https://openreview.net/pdf?id=H1eA7AEtvS paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=ALBERT:_A_Lite_BERT_for_Self-supervised_Learning_of_Language_Representations Summary]||<br />
|-<br />
|Week of Nov 2 ||John Landon Edwards || 4||From Variational to Deterministic Autoencoders ||[http://www.openreview.net/pdf?id=S1g7tpEYDS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=From_Variational_to_Deterministic_Autoencoders#Redesigned_Training_Loss_Function Summary] || [https://youtu.be/yW4eu3FWqIc Presentation]<br />
|-<br />
|Week of Nov 2 ||Wenyu Shen || 5|| Pre-training of Deep Bidirectional Transformers for Language Understanding || [https://arxiv.org/pdf/1810.04805.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F20/BERT:_Pre-training_of_Deep_Bidirectional_Transformers_for_Language_Understanding Summary] || [https://www.youtube.com/watch?v=vF5EoIFd2D8 Presentation video] <br />
|-<br />
|Week of Nov 2 || Syed Saad Naseem || 6|| Learning The Difference That Makes A Difference With Counterfactually-Augmented Data|| [https://openreview.net/pdf?id=Sklgs0NFvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Learning_The_Difference_That_Makes_A_Difference_With_Counterfactually-Augmented_Data Summary] || [https://youtu.be/bKC2BiTuSTQ Presentation video]<br />
|-<br />
|Week of Nov 9 || Donya Hamzeian || 7|| The Curious Case of Neural Text Degeneration || [https://iclr.cc/virtual_2020/poster_rygGQyrFvH.html Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=The_Curious_Case_of_Degeneration Summary] ||<br />
|-<br />
|Week of Nov 9 || Parsa Torabian || 8|| Orthogonal Gradient Descent for Continual Learning || [http://proceedings.mlr.press/v108/farajtabar20a/farajtabar20a.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=orthogonal_gradient_descent_for_continual_learning Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Arash Moayyedi || 9|| When Does Self-supervision Improve Few-shot Learning? || [https://openreview.net/forum?id=HkenPn4KPH Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=When_Does_Self-Supervision_Improve_Few-Shot_Learning%3F Summary] || Learn<br />
|-<br />
|Week of Nov 9 || Parsa Ashrafi Fashi || 10|| Learning to Generalize: Meta-Learning for Domain Generalization || [https://arxiv.org/pdf/1710.03463 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Meta-Learning_For_Domain_Generalization Summary]|| [https://youtu.be/b9MU5cc3-m0 Presentation Video]<br />
|-<br />
|Week of Nov 9 || Jaskirat Singh Bhatia || 11|| A FAIRCOMPARISON OFGRAPHNEURALNETWORKSFORGRAPHCLASSIFICATION || [https://openreview.net/pdf?id=HygDF6NFPB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=a_fair_comparison_of_graph_neural_networks_for_graph_classification Summary] || [https://drive.google.com/file/d/1Dx6mFL_zBAJcfPQdOWAuPn0_HkvTL_0z/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 9 || Gaurav Sikri || 12|| BREAKING CERTIFIED DEFENSES: SEMANTIC ADVERSARIAL EXAMPLES WITH SPOOFED ROBUSTNESS CERTIFICATES || [https://openreview.net/pdf?id=HJxdTxHYvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Breaking_Certified_Defenses:_Semantic_Adversarial_Examples_With_Spoofed_Robustness_Certificates Summary] || [[https://drive.google.com/file/d/1amkWrR8ZQKnnInjedRZ7jbXTqCA8Hy1r/view?usp=sharing Presentation ]]<br />
|-<br />
|Week of Nov 16 || Abhinav Jain || 13|| The Logical Expressiveness of Graph Neural Networks || [http://www.openreview.net/pdf?id=r1lZ7AEKvB Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=THE_LOGICAL_EXPRESSIVENESS_OF_GRAPH_NEURAL_NETWORKS Summary] || [https://drive.google.com/file/d/1mZVlF2UvJ2lGjuVcN5SYqBuO4jZjuCcU/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Gautam Bathla || 14|| One-Shot Object Detection with Co-Attention and Co-Excitation || [https://papers.nips.cc/paper/8540-one-shot-object-detection-with-co-attention-and-co-excitation.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=One-Shot_Object_Detection_with_Co-Attention_and_Co-Excitation Summary] || [https://drive.google.com/file/d/1OUx64_pTZzCQAdo_fmy_9h9NbuccTnn6/view?usp=sharing Presentation]<br />
|-<br />
|Week of Nov 16 || Shikhar Sakhuja || 15|| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems || [https://papers.nips.cc/paper/8589-superglue-a-stickier-benchmark-for-general-purpose-language-understanding-systems.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=SuperGLUE Summary] || [[https://youtu.be/5h-365TPQqE Presentation ]]<br />
|-<br />
|Week of Nov 16 || Cameron Meaney || 16|| Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations || [https://www.sciencedirect.com/science/article/pii/S0021999118307125 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Physics-informed_neural_networks:_A_deep_learning_framework_for_solving_forward_and_inverse_problems_involving_nonlinear_partial_differential_equations Summary] || Learn<br />
|-<br />
|Week of Nov 16 ||Sobhan Hemati|| 17||Adversarial Fisher Vectors for Unsupervised Representation Learning||[https://papers.nips.cc/paper/9295-adversarial-fisher-vectors-for-unsupervised-representation-learning.pdf Paper]||[https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adversarial_Fisher_Vectors_for_Unsupervised_Representation_Learning Summary] || [https://www.youtube.com/watch?v=WKUj30tgHfs&feature=youtu.be video]<br />
|-<br />
|Week of Nov 16 ||Milad Sikaroudi|| 18||Domain Genralization via Model Agnostic Learning of Semantic Features||[https://papers.nips.cc/paper/8873-domain-generalization-via-model-agnostic-learning-of-semantic-features.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Model_Agnostic_Learning_of_Semantic_Features Summary]|| [https://youtu.be/djrJG6pJaL0 video] also available on Learn<br />
|-<br />
|Week of Nov 23 ||Bowen You|| 19||DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION||[https://openreview.net/pdf?id=S1lOTC4tDS Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=DREAM_TO_CONTROL:_LEARNING_BEHAVIORS_BY_LATENT_IMAGINATION Summary] || Learn<br />
|-<br />
|Week of Nov 23 ||Nouha Chatti|| 20|| This Looks Like That: Deep Learning for Interpretable Image Recognition||[https://papers.nips.cc/paper/9095-this-looks-like-that-deep-learning-for-interpretable-image-recognition.pdf Paper]|| ||<br />
|-<br />
|Week of Nov 23 || Mohan Wu || 21|| Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Cluster for Extreme Multi-label Text Classification || [https://proceedings.icml.cc/static/paper_files/icml/2020/807-Paper.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Extreme_Multi-label_Text_Classification Summary] || [https://www.youtube.com/watch?v=jG57QgY71yU video]<br />
|-<br />
|Week of Nov 23 || Xinyi Yan || 22|| Dense Passage Retrieval for Open-Domain Question Answering || [https://arxiv.org/abs/2004.04906 Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Dense_Passage_Retrieval_for_Open-Domain_Question_Answering Summary] || Learn<br />
|-<br />
|Week of Nov 23 || Meixi Chen || 23|| Functional Regularisation for Continual Learning with Gaussian Processes || [https://arxiv.org/pdf/1901.11356.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Functional_regularisation_for_continual_learning_with_gaussian_processes Summary]|| Learn<br />
|-<br />
|Week of Nov 23 || Ahmed Salamah || 24|| AdaCompress: Adaptive Compression for Online Computer Vision Services || [https://arxiv.org/pdf/1909.08148.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Adacompress:_Adaptive_compression_for_online_computer_vision_services Summary] || [https://youtu.be/D54qsSkqryk video] or Learn<br />
|-<br />
|Week of Nov 23|| Mohammad Mahmoud || 32||Mathematical Reasoning in Latent Space|| [https://iclr.cc/virtual_2020/poster_Ske31kBtPr.html?fbclid=IwAR2TQkabQkOzGcMl6bEJYggq8X8HIUoTudPIACX2v_ZT2LteARl_sPD-XdQ] || ||<br />
|-<br />
|Week of Nov 30 ||Danial Maleki || 25||RoBERTa: A Robustly Optimized BERT Pretraining Approach ||[https://openreview.net/forum?id=SyxS0T4tvS Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Roberta Summary] || [https://youtu.be/JdfvvYbH-2s Presentation Video]<br />
|-<br />
|Week of Nov 30 ||Gursimran Singh || 26||BERTScore: Evaluating Text Generation with BERT||[https://openreview.net/pdf?id=SkeHuCVFDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=BERTScore:_Evaluating_Text_Generation_with_BERT Summary] || Learn<br />
|-<br />
|Week of Nov 30 || Govind Sharma || 27|| Time-series Generative Adversarial Networks || [https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Time-series_Generative_Adversarial_Networks Summary] || Learn<br />
|-<br />
|Week of Nov 30 ||Maral Rasoolijaberi|| 28||A critical analysis of self-supervision, or what we can learn from a single image|| [https://openreview.net/pdf?id=B1esx6EYvr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=CRITICAL_ANALYSIS_OF_SELF-SUPERVISION Summary]|| [https://youtu.be/HkkacHrvloE YouTube]<br />
|-<br />
|Week of Nov 30 || Sina Farsangi || 29|| Self-Supervised Learning of Pretext-Invariant Representations || [https://openaccess.thecvf.com/content_CVPR_2020/papers/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.pdf Paper]|| [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Self-Supervised_Learning_of_Pretext-Invariant_Representations Summary] || [https://www.youtube.com/watch?v=IlIPHclzV5E&ab_channel=sinaebrahimifarsangi YouTube] or Learn<br />
|-<br />
|Week of Nov 30 || Pierre McWhannel || 30|| Pre-training Tasks for Embedding-based Large-scale Retrieval || [https://openreview.net/pdf?id=rkg-mA4FDr Paper] || [https://wiki.math.uwaterloo.ca/statwiki/index.php?title=Pre-Training_Tasks_For_Embedding-Based_Large-Scale_Retrieval Summary]|| Learn<br />
|-<br />
|Week of Nov 30 || Wenjuan Qi || 31|| Network Deconvolution || [https://openreview.net/pdf?id=rkeu30EtvS Paper] || placeholder||</div>G45sharm