STAT946F17/ Improved Variational Inference with Inverse Autoregressive Flow
\section{Introduction}
One of the most common ways to formalize machine learning models is through the use of \textbf{latent variable models}, wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some \textit{hidden} variables. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. The \textbf{mixture of Gaussians} provides a particularly nice example of a latent variable model. One way to think about a mixture with $K$ Gaussians is as follows. First, roll a $K$-sided die and suppose that the result is $k$ with probabililty $\pi_{k}$. Then randomly generate a point from the Gaussian distribution with parameters $\mu_{k}$ and $\Sigma_{k}$. The reason this is a hidden variable model is that, when we have a dataset coming from a mixture of Gaussians, we only get to see the datapoints that are generated at the end. For a given observed datapoint we neither get to see the die that is rolled in generating that point nor do we know what the probabilities $\pi_{k}$ are. The $\pi_{k}$ are therefore hidden variables and, together with estimation of the parameters $\mu_{k}$, $\Sigma_{k}$ determining observations, estimating the $\pi_{k}$ constitutes inference within the mixture of Gaussians model. Note that all the parameters to be estimated can be wrapped into a long vector $\theta = (\pi_{1}, \ldots, \pi_{K}, \mu_{1}, \Sigma_{1}, \ldots, \mu_{K}, \Sigma_{K})$.
More generally, latent variable models provide a powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. It behoves us therefore to obtain general methods which allow tractable inference within latent variable models. One such method is known as \textbf{variational inference} and it, in its modern form, was introduced to machine learning around two decades ago in the seminal paper \cite{jordanVI}. More recently, and more apropos of deep learning, stochastic versions of variational inference are being combined with neural networks to provide robust estimation of parameters in probabilistic models. The original impetus for this fusion apparently stems from publication of \cite{autoencoderKingma} and \cite{autoencoderRezende}. In the interim, a cottage industry for application of stochastic variational inference or methods related to it have seemingly sprung up, especially as witnessed by the variety of autoencoders currently being sold at the bazaar. The paper \cite{946paper} represents another interesting contribution in parameter estimation by way of deep learning. Note that, at time of writing, variational methods are being applied to a wide range of problems in machine learning and we will only develop the small part of it necessary for our purposes. But refer to \cite{VISurvey} for a survey.
\section{Black-box Variational Inference}
The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, often called the \textbf{generative model} in the literature, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. This assumption often holds in real-world applications since generative models for real phenomena are extremely difficult or impossible to integrate. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables and, by Bayes' rule, this requires computation of the marginal distribution $p_{\theta}(x)$.
The variational inference approach entails positing a parametric family $q_{\phi}(h\mid x)$, also called the \textbf{inference model}, of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely, we minimize the KL divergence between the true posterior and the approximate posterior. However, there is a slightly indirect approach we can take: we can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(h\mid x)$, we have \begin{align*} \log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\ &= \log\int_{h}p_{\theta}(x,h)\frac{q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} \\ &= \log\int_{h}\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}q_{\phi}(h\mid x) \\ &= \log\mathbb{E}_{q}\left[\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\ &\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\ &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\ &\coloneqq \mathcal{L}(x, \theta, \phi), \end{align*} where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textbf{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$.
Before moving on, we note that there are alternative ways of expressing the ELBO which can either provide insight or aid in further calculation. For one alternative form, note that we can massage the ELBO like so. \begin{align*} \mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) p(x) - \log q_{\phi}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p_{\theta}(h \mid x) + \log p(x) - \log q_{\phi}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p(x) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p(x) \right] - \mathbb{E}_{q} \left[ \log q_{\phi}(h \mid x) - \log p_{\theta}(h \mid x) \right] \\ = & \log p(x) - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p(h \mid x) \right]. \end{align*} The last expression has a very simple interpretation: maximizing $\mathcal{L}(x, \theta, \phi)$ is equivalent to minimizing the KL divergence between the approximate posterior $q_{\phi}$ and the actual posterior $p_{\theta}(h \mid x)$. In fact, we can rewrite the above equation as a ``conservation law" \[ \mathcal{L}(x, \theta, \phi) + \mathrm{KL} \left[ q_{\phi}(h \mid x) || p(h \mid x) \right] = \log p(x). \] On the other hand, we can also do \begin{align*} \mathcal{L}(x, \theta, \phi) = & \mathbb{E}_{q} \left[ \log p_{\theta}(x, h) - \log q_{\phi}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) + \log p_{\theta}(h) - \log q_{\phi}(h \mid x) \right] \\ = & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) - \log q_{\phi}(h \mid x) + \log p_{\theta}(h) \right] \\ = & \mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right] - \mathrm{KL}\left[ q_{\phi}(h \mid x) || p_{\theta}(h) \right]. \end{align*} and the hermeneutics here is a bit more interesting. Recall that $q_{\phi}(h \mid x)$ is a distribution we get to choose and choosing a ``good" distribution means choosing something which we believe is faithful to the way observations get ``encoded" or ``compressed" into ``hidden representations". Conversely, $p_{\theta}(x \mid h)$ may be thought as a ``decoder" which unpacks latent ``codes" into observations. Thus, we can think of $\mathbb{E}_{q} \left[ \log p_{\theta}(x \mid h) \right]$ as the expected reconstruction error when we use $q_{\phi}$ as an encoder. The KL term is now interpreted as a regularizer which restricts divergence of the encoder from the prior distribution over latent codes.
Note that everything that we have done so far is completely general and independent of any specific modelling assumptions we may have had to make. Indeed, it is the model independence of this approach which led the authors of \cite{bbvi} to christen it \textbf{black-box variational inference}.
Regardless of which ELBO we use, inference requires the gradients of $\mathcal{L}(x, \theta, \phi)$. Notice that, no matter what, there are expectations with respect to $q_{\phi}$ involved and the presence of these expectations persists into the gradients. Generally speaking, such expectations are intractable integrals for any approximate posterior approaching verisimilitude. However, the notion of \textbf{normalizing flow} represents a technical innovation which allows use of flexible posteriors while maintaining tractability through calculation of expectations against simple distributions (e.g., Gaussians) only.
\section{Variational Inference using Normalizing Flows}
While our main goal is to describe \cite{946paper}, the paper \cite{normalizing_flow} provides the necessary conceptual backdrop for \cite{946paper} and we will now take a detour through some of the points presented in the latter. The main contribution of \cite{normalizing_flow} lies in a novel technique for creating a rich class of approximate posteriors starting from relatively simple ones. This is important since one of the main drawbacks to the variational approach is that it requires assumptions on the form of the approximate posterior $q_{\phi}(h\mid x)$ and practicality often forces us to stick to simple distributions which fail to capture rich, multimodal properties of the true posterior $p(z\mid x)$. The primary technical tool used in \cite{normalizing_flow} to achieve complexity in the approximate posterior is what we earlier referred to as normalizing flow, which entails using a series of invertible functions to transform simple probability densities into more complex densities.
Suppose we have a random variable $h$ with probability density $q(h)$ and that $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is an invertible function with inverse $g:\mathbb{R}^{d}\to\mathbb{R}^{d}$. A basic result in probability states that the random variable $h'\colon = f(h)$ has distribution $$q'(h') = q(h)\Bigg\lvert\det\frac{\partial f}{\partial h}\Bigg\rvert^{-1}.$$ Chaining together a (finite) sequence of invertible maps $f_{1},\ldots,f_{K}$ and applying it to the distribution $q_{0}$ of a random variable $h_{0}$ leads to the formula $$q_{K}(h_{K}) = q_{0}(h_{0})\prod\limits^{K}_{k=1}\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert^{-1},$$ where $h_{k}\colon = f_{k}(h_{k-1})$ and $q_{k}$ is the distribution associated to $h_{k}$. We can equivalently rewrite the above equation as $$\log q_{K}(h_{K}) = \log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert.$$ So, if we were to start from a simple distribution $q_{0}(h_{0})$, choose a sequence of functions $f_{1},\ldots,f_{K}$ and then \textit{define} $$q_{\phi}(h\mid x)\colon = q_{K}(z_{K}),$$ we can manipulate the ELBO as follows: \begin{align*} \mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h) - \log q_{K}(z_{K})\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right]. \end{align*} The reason this is an useful thing to do is the \textbf{law of the unconscious statistician (LOTUS)} as applied to $q_{K} = f_{K}\circ\cdots\circ f_{1}(q_{0})$: $$\mathbb{E}_{q_{K}}\left[s(h_{K})\right] = \mathbb{E}_{q_{0}}\left[s(f_{K}\circ\cdots\circ f_{1}(h_{0}))\right]$$ assuming that $s$ does not depend on $q_{K}$. Hence, \begin{align*} \mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{K}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\ &= \mathbb{E}_{q_{0}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{0}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{0}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \end{align*} and we are only computing expectations with respect to the simple distribution $q_{0}$. The latter expression for ELBO is called the \textbf{flow-based free energy bound} in \cite{normalizing_flow}. Note that $\theta$ has apparently disappeared in the final expression even though it is still present in $\mathcal{L}(x,\theta,\phi)$. This is an illusion: the parameters $\phi$ are associated to $q_{\phi}(h\mid x) = q_{K}(h_{K})$ and $q_{K}(h_{K})$ depends on the quantities $q_{0}(h_{0})$ and $\frac{\partial f_{k}}{\partial h_{k-1}}$. Thus, $\phi$ now encapsulates the defining parameters of $q_{0}$ and the $f_{k}$.
To summarize, we can start with a very simple distribution $q_{0}$ against which expectations are easy to calculate and if we can cleverly choose a series of invertible functions $\{f_{k}\}^{K}_{k=1}$ for which it is easy to compute determinants of the Jacobians, we can get a relatively rich approximate posterior $q_{K}$ such that the ELBO and its gradients are tractable. We should now ask: what is a suitable family of functions which can serve as a normalizing flow?
\section{Inverse Autoregressive Flow} The answer presented by \cite{946paper} to the last question is \[ f_{k}(h_{k-1}) \coloneqq \mu_{k} + \sigma_{k} \odot h_{k-1}, \] where $\odot$ means element-wise multiplication, $\mu_{k}$, $\sigma_{k}$ are outputs from an autoregressive neural network with inputs $h_{k-1}$ and an extra constant vector $c$ and we initialize with \[ h_{0} \coloneqq \mu_{0} + \sigma_{0} \odot \epsilon \] such that $\epsilon \!\sim \mathcal{N}(0, I)$. We will solve the mystery of where this definition comes from later. The important point is that the functional form of $f_{k}$ is parametrized by the outputs of an autoregressive neural network and this implies that the Jacobians \[ \frac{d \mu_{k}}{d h_{k-1}}, \frac{d \sigma_{k}}{d h_{k-1}} \] are triangular with zeroes on the diagonal. Hence, the derivative \[ \frac{d f_{k}}{d h_{k-1}} \] is a triangular matrix with the entries of $\sigma_{k}$ occupying the diagonal. The determinant of this is obviously just \[ \prod\limits^{D}_{i=1} \sigma_{k, i}, \] and it is very cheap to compute. Note also that the approximate posterior that comes out of this normalizing flow is \[ \log q_{K}(h_{K}) = - \sum\limits^{D}_{i=1} \left[ \frac{1}{2} \epsilon_{i}^{2} + \frac{1}{2} \log 2\pi + \sum\limits^{K}_{k=1} \log \sigma_{k,i} \right]. \] In conclusion, we have a nice expression for the ELBO. To round out the parsimony of the ELBO, we need an inference model which is computationally cheap to evaluate. Additionally, we typically do not analytically calculate the ELBO gradients but instead perform Monte Carlo estimation by sampling from the inference model. Since expectations are calculated only with respect to the simple initial distribution $q_{0}$, both of these requirements are easily satisfied.
\section{Inverse Autoregressive Transformations or, Whence Inverse Autoregressive Flow?}
\section{Concluding Remarks}