# a Deeper Look into Importance Sampling

### A Deeper Look into Importance Sampling - June 2, 2009

From last class, we have determined that an integral can be written in the form $I = \displaystyle\int h(x)f(x)\,dx$ $= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx$ We continue our discussion of Importance Sampling here.

#### Importance Sampling

We can see that the integral $\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx$ is just $\displaystyle E_g(h(x)) \rightarrow$the expectation of h(x) with respect to g(x), where $\displaystyle \frac{f(x)}{g(x)}$ is a weight $\displaystyle\beta(x)$. In the case where $\displaystyle f \gt g$, a greater weight for $\displaystyle\beta(x)$ will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.

##### Problem

The method of Importance Sampling is simple but can lead to some problems. The $\displaystyle \hat I$ estimated by Importance Sampling could have infinite standard error.

Given $\displaystyle I= \int w(x) g(x) dx$ $= \displaystyle E_g(w(x))$ $= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i)$ where $\displaystyle w(x)=\frac{f(x)h(x)}{g(x)}$.

Obtaining the second moment,

$\displaystyle E[(w(x))^2]$
$\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx$
$\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx$
$\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx$

We can see that if $\displaystyle g(x) \rightarrow 0$, then $\displaystyle E[(w(x))^2] \rightarrow \infty$. This occurs if $\displaystyle g$ has a thinner tail than $\displaystyle f$ then $\frac{h^2(x)f^2(x)}{g(x)}$ could be infinitely large. The general idea here is that $\frac{f(x)}{g(x)}$ should not be large.

##### Remark 1

It is evident that $\displaystyle g(x)$ should be chosen such that it has a thicker tail than $\displaystyle f(x)$. If $\displaystyle f$ is large over set $\displaystyle A$ but $\displaystyle g$ is small, then $\displaystyle \frac{f}{g}$ would be large and it would result in a large variance.

##### Remark 2

It is useful if we can choose $\displaystyle g$ to be similar to $\displaystyle f$ in terms of shape. Ideally, the optimal $\displaystyle g$ should be similar to $\displaystyle \left| h(x) \right|f(x)$, and have a thicker tail. Analytically, we can show that the best $\displaystyle g$ is the one that would result in a variance that is minimized.

##### Remark 3

Choose $\displaystyle g$ such that it is similar to $\displaystyle \left| h(x) \right| f(x)$ in terms of shape. That is, we want $\displaystyle g \propto \displaystyle \left| h(x) \right| f(x)$

#### Theorem (Minimum Variance Choice of $\displaystyle g$)

The choice of $\displaystyle g$ that minimizes variance of $\hat I$ is $\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}$.

##### Proof:

We know that $\displaystyle w(x)=\frac{f(x)h(x)}{g(x)}$

The variance of $\displaystyle w(x)$ is

$\displaystyle Var[w(x)]$
$\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2$
$\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2$
$\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2$

As we can see, the second term does not depend on $\displaystyle g(x)$. Therefore to minimize $\displaystyle Var[w(x)]$ we only need to minimize the first term. In doing so we will use Jensen's Inequality.

###### Aside: Jensen's Inequality

If $\displaystyle g$ is a convex function ( twice differentiable and $\displaystyle g''(x) \geq 0$ ) then $\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)$
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:

$\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n)$ where $\alpha_1 + \alpha_2 + ... + \alpha_n = 1$
##### Proof (cont)

Using Jensen's Inequality,

$\displaystyle g(E[x]) \leq E[g(x)]$ as $\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \lt = p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)]$

Therefore

$\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2$
$\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2$

and

$\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2$
$\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2$
$\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2$ since $\displaystyle f$ and $\displaystyle g$ are density functions, $\displaystyle f, g$ cannot be negative.

Thus, this is a lower bound on $\displaystyle E[(w(x))^2]$. If we replace $\displaystyle g^*(x)$ into $\displaystyle E[g^*(x)]$, we can see that the result is as we require. Details omitted.

However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute $\displaystyle g^*$.

Note: Jensen's inequality is actually unnecessary here. We just use it to get $E[(w(x))^2] \geq (E[|w(x)|])^2$, which could be derived using variance properties: $0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2$.