STAT946F17/ Improved Variational Inference with Inverse Autoregressive Flow

From statwiki
Revision as of 07:19, 2 October 2017 by SHKhan (talk | contribs)
Jump to navigation Jump to search

Introduction

One of the most common ways to formalize machine learning models is through the use of latent variable models wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some "hidden" variables. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. The mixture of Gaussians provides a particularly nice example of a latent variable model. One way to think about a mixture with $K$ Gaussians is as follows. First, roll a $K$-sided die and suppose that the result is $k$ with probabililty $\pi_{k}$. Then randomly generate a point from the Gaussian distribution with parameters $\mu_{k}$ and $\Sigma_{k}$. The reason this is a hidden variable model is that, when we have a dataset coming from a mixture of Gaussians, we only get to see the datapoints that are generated at the end. For a given observed datapoint we neither get to see the die that is rolled in generating that that point nor do we know what the probabilities $\pi_{k}$ are. The $\pi_{k}$ are therefore hidden variables and, together with estimation of the parameters $\mu_{k}$, $\Sigma_{k}$ determining observations, estimating the $\pi_{k}$ constitutes inference within the mixture of Gaussians model. Note that all the parameters to be estimated can be wrapped into a long vector $\theta = (\pi_{1}, \ldots, \pi_{K}, \mu_{1}, \Sigma_{1}, \ldots, \mu_{K}, \Sigma_{K})$.

More generally, latent variable models provide a powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. It behoves us therefore to obtain general methods which allow tractable inference within latent variable models. One such method is known as variational inference and it was introduced in its modern form around two decades ago in the seminal paper [1]. More recently, and more apropos of deep learning, stochastic versions of variational inference are being combined with neural networks to provide robust estimation of parameters in probabilistic models. The original impetus for this fusion apparently stems from publication of [2] and [3]. In the interim, a cottage industry for application of stochastic variational inference or methods related to it have seemingly sprung up, especially as witnessed by the variety of autoencoders currently being sold at the bazaar. The paper [4] represents another interesting contribution in parameter estimation by way of deep learning. Note that, at time of writing, variational methods are being applied to a wide range of problems in machine learning and we will only develop the small part of it necessary for our purposes.

Black-box Variational Inference

The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, often called the generative model in the literature, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. This assumption often holdss in real-world applications since generative models for real phenomena are extremely difficult or impossible to integrate. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables and, by Bayes' rule, this requires computation of the marginal distribution $p_{\theta}(x)$.

The variational inference approach entails positing a parametric family $q_{\phi}(h\mid x)$ of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely, we minimize the KL divergence between the true posterior and the approximate posterior. However, there is a slightly indirect approach we can take: we can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(h\mid x)$, we have \begin{align*} \log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\ &= \log\int_{h}p_{\theta}(x,h)\frac{q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} \\ &= \log\int_{h}\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}q_{\phi}(h\mid x) \\ &= \log\mathbb{E}_{q}\left[\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\ &\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\ &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\ &\colon = \mathcal{L}(x, \theta, \phi), \end{align*} where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the evidence lower bound (ELBO). Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there is a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. Note that everything that we have done so far is completely general and independent of any specific modelling assumptions we may have had to make. Indeed, it is the model independence of this approach which led to the authors of [5] to christen it black-box variational inference.

The immediate question we should ask is whether we can obtain the gradients $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. In general, the answer is no since we are taking expectations with respect to $q$ and, for any approximate posterior approaching verisimilitude, such expectations are intractable integrals. So, we do the next best thing and calculate Monte Carlo estimators for these gradients. Before moving on, it will be helpful to massage the ELBO like so.

But this raises another question: are these Monte Carlo estimates reliable enough to be useful in practice? Again, the immediate answer is negative. The main hurdle is that the second Monte Carlo estimate, i.e., the estimator for the gradient with respect to the variational parameters $\phi$, is very noisy. Hence, making this approach work requires techniques for variance reduction in the gradient estimator. But we cannot get something out of nothing and modelling assumptions are the toll we have to pay in order to cross the bridge leading towards a practical and usable Monte Carlo estimator for $\nabla_{\phi}\mathcal{L}(x, \theta, \phi)$. The paper [6] exploits a particular set of techniques and assumptions to provide such estimators (for certain contexts).

Variational Inference using Normalizing Flows

While our main goal is to describe [7], the paper [8] provides a nice conceptual warm-up and we will now take a detour through some of the points presented in the latter. The point of departure for \cite{normalizing_flow} is exactly the same as for \cite{autoencoderKingma}, including the use of a recognition model for the approximate posterior $q_{\phi}(h\mid x)$. The main contribution of the former, however, lies in a novel technique for creating a rich class of approximate posteriors starting from relatively simple ones. This is important since one of the main drawbacks to the variational approach is that it requires assumptions on the form of the approximate posterior $q_{\phi}(h\mid x)$ and practicality often forces us to stick to simple distributions which fail to capture rich, multimodal properties of the true posterior $p(z\mid x)$. The primary technical tool used in \cite{normalizing_flow} to achieve complexity in the approximate posterior is what is known as a \textit{normalizing flow}, which entails using a series of invertible functions to transform simple probability densities into more complex densities. \par Suppose we have a random variable $h$ with probability density $q(h)$ and that $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is an invertible function with inverse $g:\mathbb{R}^{d}\to\mathbb{R}^{d}$. A basic result in probability states that the random variable $h'\colon = f(h)$ has distribution $$q'(z') = q(z)\Bigg\lvert\det\frac{\partial f}{\partial h}\Bigg\rvert^{-1}.$$ Chaining together a (finite) sequence of invertible maps $f_{1},\ldots,f_{K}$ and applying it to the distribution $q_{0}$ of a random variable $h_{0}$ leads to the formula $$q_{K}(h_{K}) = q_{0}(h_{0})\prod\limits^{K}_{k=1}\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert^{-1},$$ where $h_{k}\colon = f_{k}(h_{k-1})$ and $q_{k}$ is the distribution associated to $h_{k}$. We can equivalently rewrite the above equation as $$\log q_{K}(h_{K}) = \log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert.$$ So, if we were to start from a simple distribution $q_{0}(h_{0})$, choose a sequence of functions $f_{1},\ldots,f_{K}$ and then \textit{define} $$q_{\phi}(h\mid x)\colon = q_{K}(z_{K}),$$ we can manipulate the ELBO as follows: \begin{align*} \mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h) - \log q_{K}(z_{K})\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right]. \end{align*} The reason this is an useful thing to do is the \textit{law of the unconscious statistician (LOTUS)} as applied to $q_{K} = f_{K}\circ\cdots\circ f_{1}(q_{0})$: $$\mathbb{E}_{q_{K}}\left[s(z_{K})\right] = \mathbb{E}_{q_{0}}\left[s(f_{K}\circ\cdots\circ f_{1}(z_{0}))\right]$$ assuming that $s$ does not depend on $q_{K}$. Hence, \begin{align*} \mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{K}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\ &= \mathbb{E}_{q_{0}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{0}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{0}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \end{align*} and we are only computing expectations with respect to the simple distribution $q_{0}$. The latter expression for ELBO is called the \textit{flow-based free energy bound} in \cite{normalizing_flow}. Note that $\theta$ has apparently disappeared in the final expression even though it is still present in $\mathcal{L}(x,\theta,\phi)$. This is an illusion: the parameters $\phi$ are associated to $q_{\phi}(h\mid x) = q_{K}(h_{K})$ and $q_{K}(h_{K})$ depends on the quantities $q_{0}(h_{0})$ and $\frac{\partial f_{k}}{\partial h_{k-1}}$. Thus, $\phi$ now encapsulates the defining parameters of $q_{0}$ and the $f_{k}$. An example below will help clarify this point. \par As we have established by now, it is not enough to have a lower bound on the log-likelihood. We must be able to estimate gradients for the lower bound in practice.

References

<references/>