STAT946F17/ Improved Variational Inference with Inverse Autoregressive Flow: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Copy-paste of Sakif's personal notes. Need further tweaking in terms of references, etc.)
(This and previous edits replaced all text with MathJax enabled (via TeX the World for Chromium).)
Line 1: Line 1:
== Introduction ==
== Introduction ==
One of the most common ways to formalize machine learning models is through the use of '''latent variable models''' wherein we have a probabilistic model for the joint distribution between observed datapoints <math> x</math> and some "hidden" variables. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. For instance, if we generate a number using a standard pseudo-random number generator, then we only get to observe the number that gets spit out by the computer. What we do not get to see is the process whereby such a number is obtained even though there is of course a deterministic computation taking place under the covers. Latent variable models provide a very general and powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. It behoves us therefore to obtain general methods which allow tractable inference within latent variable models. One such method is known as '''variational inference''' and it was introduced in its modern form around two decades ago in the seminal paper  [https://link.springer.com/article/10.1023%2FA%3A1008932416310]. In the present day, variational inference is a large and active area of research and we will only develop the small part of it necessary for our purposes.
One of the most common ways to formalize machine learning models is through the use of '''latent variable models''' wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some "hidden" variables. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. For instance, if we generate a number using a standard pseudo-random number generator, then we only get to observe the number that gets spit out by the computer. What we do not get to see is the process whereby such a number is obtained even though there is of course a deterministic computation taking place under the covers. Latent variable models provide a very general and powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. It behoves us therefore to obtain general methods which allow tractable inference within latent variable models. One such method is known as '''variational inference''' and it was introduced in its modern form around two decades ago in the seminal paper  [https://link.springer.com/article/10.1023%2FA%3A1008932416310]. In the present day, variational inference is a large and active area of research and we will only develop the small part of it necessary for our purposes.


== Black-box Variational Inference ==
== Black-box Variational Inference ==


The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables. The variational inference approach entails positing a parametric family $q_{\phi}(z\mid x)$ of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely,  we minimize the KL divergence between the true posterior and the approximate posterior. However, there is a slightly indirect approach we can take: we can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(z\mid x)$, we have
The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables. The variational inference approach entails positing a parametric family $q_{\phi}(z\mid x)$ of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely,  we minimize the KL divergence between the true posterior and the approximate posterior. However, there is a slightly indirect approach we can take: we can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(z\mid x)$, we have
<math>
\begin{align*}
\begin{align*}
\log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\
\log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\
Line 15: Line 14:
&\coloneqq \mathcal{L}(x, \theta, \phi),
&\coloneqq \mathcal{L}(x, \theta, \phi),
\end{align*}
\end{align*}
</math>
where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textit{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. At this point, authors often rewrite the ELBO in a form which is most suited to their work. However, the succeeding steps are typically the same:
where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textit{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. At this point, authors often rewrite the ELBO in a form which is most suited to their work. However, the succeeding steps are typically the same:
\begin{enumerate}[label=(\arabic*)]
\begin{enumerate}[label=(\arabic*)]

Revision as of 22:10, 30 September 2017

Introduction

One of the most common ways to formalize machine learning models is through the use of latent variable models wherein we have a probabilistic model for the joint distribution between observed datapoints $x$ and some "hidden" variables. The intuition is that the hidden variables share some sort of (perhaps prolix) causal relationship with the variables that are actually observed. For instance, if we generate a number using a standard pseudo-random number generator, then we only get to observe the number that gets spit out by the computer. What we do not get to see is the process whereby such a number is obtained even though there is of course a deterministic computation taking place under the covers. Latent variable models provide a very general and powerful framework to mathematically encode a variety of phenomena which are naturally subject to stochasticity. Thus, they form an important part of the theory underlying many machine learning models. Indeed, it can even be said that most machine learning models, when viewed appropriately, are latent variable models. It behoves us therefore to obtain general methods which allow tractable inference within latent variable models. One such method is known as variational inference and it was introduced in its modern form around two decades ago in the seminal paper [1]. In the present day, variational inference is a large and active area of research and we will only develop the small part of it necessary for our purposes.

Black-box Variational Inference

The basic premise we start from is that we have a latent variable model $p_{\theta}(x, h)$, with $x$ the observed variables and $h$ the hidden variables, and we wish to learn the parameters $\theta$. We also assume we are in a situation where the usual strategy of inference by maximum likelihood estimation is infeasible due to intractability of marginalization of the hidden variables. Additionally, we would like to be able to compute the posterior $p(h\mid x)$ over hidden variables. The variational inference approach entails positing a parametric family $q_{\phi}(z\mid x)$ of distributions and introducing new learning parameters $\phi$ which obtain as solutions to an optimization problem. More precisely, we minimize the KL divergence between the true posterior and the approximate posterior. However, there is a slightly indirect approach we can take: we can find a generic lower bound for the log-likelihood $\log p_{\theta}(x)$ and optimize for this lower bound. Observe that, for any parametrized distribution $q_{\phi}(z\mid x)$, we have \begin{align*} \log p_{\theta}(x) &= \log\int_{h}p_{\theta}(x,h) \\ &= \log\int_{h}p_{\theta}(x,h)\frac{q_{\phi}(h\mid x)}{q_{\phi}(h\mid x)} \\ &= \log\int_{h}\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}q_{\phi}(h\mid x) \\ &= \log\mathbb{E}_{q}\left[\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\ &\geq \mathbb{E}_{q}\left[\log\frac{p_{\theta}(x,h)}{q_{\phi}(h\mid x)}\right] \\ &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\ &\coloneqq \mathcal{L}(x, \theta, \phi), \end{align*} where the inequality is an application of Jensen's inequality for the logarithm function and $\mathcal{L}(x,\theta,\phi)$ is known as the \textit{evidence lower bound (ELBO)}. Clearly, if we iteratively choose values for $\theta$ and $\phi$ such that $\mathcal{L}(x,\theta,\phi)$ increases, then we will have found values for $\theta$ such that the log-likelihood $\log p_{\theta}(x)$ is non-decreasing (that is, there is no guarantee that a value for $\theta$ which increases $\mathcal{L}(x,\theta,\phi)$ will also increase $\log p_{\theta}(x)$ but there \textit{is} a guarantee that $\log p_{\theta}(x)$ will not decrease). The natural search strategy now is to use stochastic gradient ascent on $\mathcal{L}(x,\theta,\phi)$. This requires the derivatives $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$. At this point, authors often rewrite the ELBO in a form which is most suited to their work. However, the succeeding steps are typically the same: \begin{enumerate}[label=(\arabic*)] \item Derive formulae for $\nabla_{\theta}\mathcal{L}(x,\theta,\phi)$ and $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$ as expectations with respect to $q$. \item Perform Monte Carlo approximations for the expectations obtained in the first step by sampling from the approximate posterior. \item Remark that the gradient estimator for $\nabla_{\phi}\mathcal{L}(x,\theta,\phi)$ is very noisy. \item Outline variance reduction methods for the latter. \end{enumerate} Everything that we have done so far, except for step (4) above, is completely general and independent of any specific modelling assumptions we may have had to make. Indeed, it is the model independence of this approach which led to the authors of \cite{bbvi} to christen it \textit{black-box} variational inference. However, modelling assumptions are the toll we have to pay in order to cross the bridge leading towards a practical and usable Monte Carlo estimator for $\nabla_{\phi}\mathcal{L}(x, \theta, \phi)$.

Variational Inference using Normalizing Flows

he point of departure for \cite{normalizing_flow} is exactly the same as for \cite{autoencoderKingma}, including the use of a recognition model for the approximate posterior $q_{\phi}(h\mid x)$. The main contribution of the former, however, lies in a novel technique for creating a rich class of approximate posteriors starting from relatively simple ones. This is important since one of the main drawbacks to the variational approach is that it requires assumptions on the form of the approximate posterior $q_{\phi}(h\mid x)$ and practicality often forces us to stick to simple distributions which fail to capture rich, multimodal properties of the true posterior $p(z\mid x)$. The primary technical tool used in \cite{normalizing_flow} to achieve complexity in the approximate posterior is what is known as a \textit{normalizing flow}, which entails using a series of invertible functions to transform simple probability densities into more complex densities. \par Suppose we have a random variable $h$ with probability density $q(h)$ and that $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is an invertible function with inverse $g:\mathbb{R}^{d}\to\mathbb{R}^{d}$. A basic result in probability states that the random variable $h'\coloneqq f(h)$ has distribution $$q'(z') = q(z)\Bigg\lvert\det\frac{\partial f}{\partial h}\Bigg\rvert^{-1}.$$ Chaining together a (finite) sequence of invertible maps $f_{1},\ldots,f_{K}$ and applying it to the distribution $q_{0}$ of a random variable $h_{0}$ leads to the formula $$q_{K}(h_{K}) = q_{0}(h_{0})\prod\limits^{K}_{k=1}\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert^{-1},$$ where $h_{k}\coloneqq f_{k}(h_{k-1})$ and $q_{k}$ is the distribution associated to $h_{k}$. We can equivalently rewrite the above equation as $$\log q_{K}(h_{K}) = \log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert.$$ So, if we were to start from a simple distribution $q_{0}(h_{0})$, choose a sequence of functions $f_{1},\ldots,f_{K}$ and then \textit{define} $$q_{\phi}(h\mid x)\coloneqq q_{K}(z_{K}),$$ we can manipulate the ELBO as follows: \begin{align*} \mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q}\left[\log p_{\theta}(x,h) - \log q_{\phi}(h\mid x)\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h) - \log q_{K}(z_{K})\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right]. \end{align*} The reason this is an useful thing to do is the \textit{law of the unconscious statistician (LOTUS)} as applied to $q_{K} = f_{K}\circ\cdots\circ f_{1}(q_{0})$: $$\mathbb{E}_{q_{K}}\left[s(z_{K})\right] = \mathbb{E}_{q_{0}}\left[s(f_{K}\circ\cdots\circ f_{1}(z_{0}))\right]$$ assuming that $s$ does not depend on $q_{K}$. Hence, \begin{align*} \mathcal{L}(x,\theta,\phi) &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{K}(z_{K})\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0}) - \sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\ &= \mathbb{E}_{q_{K}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{K}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{K}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \\ &= \mathbb{E}_{q_{0}}\left[\log p_{\theta}(x,h)\right] - \mathbb{E}_{q_{0}}\left[\log q_{0}(h_{0})\right] - \mathbb{E}_{q_{0}}\left[\sum\limits^{K}_{k=1}\log\Bigg\lvert\det\frac{\partial f_{k}}{\partial h_{k-1}}\Bigg\rvert\right] \end{align*} and we are only computing expectations with respect to the simple distribution $q_{0}$. The latter expression for ELBO is called the \textit{flow-based free energy bound} in \cite{normalizing_flow}. Note that $\theta$ has apparently disappeared in the final expression even though it is still present in $\mathcal{L}(x,\theta,\phi)$. This is an illusion: the parameters $\phi$ are associated to $q_{\phi}(h\mid x) = q_{K}(h_{K})$ and $q_{K}(h_{K})$ depends on the quantities $q_{0}(h_{0})$ and $\frac{\partial f_{k}}{\partial h_{k-1}}$. Thus, $\phi$ now encapsulates the defining parameters of $q_{0}$ and the $f_{k}$. An example below will help clarify this point. \par As we have established by now, it is not enough to have a lower bound on the log-likelihood. We must be able to estimate gradients for the lower bound in practice.