a Deeper Look into Importance Sampling

From statwiki
Revision as of 02:01, 9 June 2009 by H5choi (talk | contribs) (Added further notes to Remark 2)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

A Deeper Look into Importance Sampling - June 2, 2009

From last class, we have determined that an integral can be written in the form [math]\displaystyle{ I = \displaystyle\int h(x)f(x)\,dx }[/math] [math]\displaystyle{ = \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx }[/math] We continue our discussion of Importance Sampling here.

Importance Sampling

We can see that the integral [math]\displaystyle{ \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx }[/math] is just [math]\displaystyle{ \displaystyle E_g(h(x)) \rightarrow }[/math]the expectation of h(x) with respect to g(x), where [math]\displaystyle{ \displaystyle \frac{f(x)}{g(x)} }[/math] is a weight [math]\displaystyle{ \displaystyle\beta(x) }[/math]. In the case where [math]\displaystyle{ \displaystyle f \gt g }[/math], a greater weight for [math]\displaystyle{ \displaystyle\beta(x) }[/math] will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.

Problem

The method of Importance Sampling is simple but can lead to some problems. The [math]\displaystyle{ \displaystyle \hat I }[/math] estimated by Importance Sampling could have infinite standard error.

Given [math]\displaystyle{ \displaystyle I= \int w(x) g(x) dx }[/math] [math]\displaystyle{ = \displaystyle E_g(w(x)) }[/math] [math]\displaystyle{ = \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) }[/math] where [math]\displaystyle{ \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} }[/math].

Obtaining the second moment,

[math]\displaystyle{ \displaystyle E[(w(x))^2] }[/math]
[math]\displaystyle{ \displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx }[/math]
[math]\displaystyle{ \displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx }[/math]
[math]\displaystyle{ \displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx }[/math]

We can see that if [math]\displaystyle{ \displaystyle g(x) \rightarrow 0 }[/math], then [math]\displaystyle{ \displaystyle E[(w(x))^2] \rightarrow \infty }[/math]. This occurs if [math]\displaystyle{ \displaystyle g }[/math] has a thinner tail than [math]\displaystyle{ \displaystyle f }[/math] then [math]\displaystyle{ \frac{h^2(x)f^2(x)}{g(x)} }[/math] could be infinitely large. The general idea here is that [math]\displaystyle{ \frac{f(x)}{g(x)} }[/math] should not be large.

Remark 1

It is evident that [math]\displaystyle{ \displaystyle g(x) }[/math] should be chosen such that it has a thicker tail than [math]\displaystyle{ \displaystyle f(x) }[/math]. If [math]\displaystyle{ \displaystyle f }[/math] is large over set [math]\displaystyle{ \displaystyle A }[/math] but [math]\displaystyle{ \displaystyle g }[/math] is small, then [math]\displaystyle{ \displaystyle \frac{f}{g} }[/math] would be large and it would result in a large variance.

Remark 2

It is useful if we can choose [math]\displaystyle{ \displaystyle g }[/math] to be similar to [math]\displaystyle{ \displaystyle f }[/math] in terms of shape. Ideally, the optimal [math]\displaystyle{ \displaystyle g }[/math] should be similar to [math]\displaystyle{ \displaystyle \left| h(x) \right|f(x) }[/math], and have a thicker tail. Analytically, we can show that the best [math]\displaystyle{ \displaystyle g }[/math] is the one that would result in a variance that is minimized.

Remark 3

Choose [math]\displaystyle{ \displaystyle g }[/math] such that it is similar to [math]\displaystyle{ \displaystyle \left| h(x) \right| f(x) }[/math] in terms of shape. That is, we want [math]\displaystyle{ \displaystyle g \propto \displaystyle \left| h(x) \right| f(x) }[/math]


Theorem (Minimum Variance Choice of [math]\displaystyle{ \displaystyle g }[/math])

The choice of [math]\displaystyle{ \displaystyle g }[/math] that minimizes variance of [math]\displaystyle{ \hat I }[/math] is [math]\displaystyle{ \displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds} }[/math].

Proof:

We know that [math]\displaystyle{ \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} }[/math]

The variance of [math]\displaystyle{ \displaystyle w(x) }[/math] is

[math]\displaystyle{ \displaystyle Var[w(x)] }[/math]
[math]\displaystyle{ \displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 }[/math]
[math]\displaystyle{ \displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 }[/math]
[math]\displaystyle{ \displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 }[/math]

As we can see, the second term does not depend on [math]\displaystyle{ \displaystyle g(x) }[/math]. Therefore to minimize [math]\displaystyle{ \displaystyle Var[w(x)] }[/math] we only need to minimize the first term. In doing so we will use Jensen's Inequality.

Aside: Jensen's Inequality

If [math]\displaystyle{ \displaystyle g }[/math] is a convex function ( twice differentiable and [math]\displaystyle{ \displaystyle g''(x) \geq 0 }[/math] ) then [math]\displaystyle{ \displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2) }[/math]
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:

[math]\displaystyle{ \displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) }[/math] where [math]\displaystyle{ \alpha_1 + \alpha_2 + ... + \alpha_n = 1 }[/math]
Proof (cont)

Using Jensen's Inequality,

[math]\displaystyle{ \displaystyle g(E[x]) \leq E[g(x)] }[/math] as [math]\displaystyle{ \displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \lt = p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] }[/math]

Therefore

[math]\displaystyle{ \displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 }[/math]
[math]\displaystyle{ \displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 }[/math]

and

[math]\displaystyle{ \displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 }[/math]
[math]\displaystyle{ \displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 }[/math]
[math]\displaystyle{ \displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 }[/math] since [math]\displaystyle{ \displaystyle f }[/math] and [math]\displaystyle{ \displaystyle g }[/math] are density functions, [math]\displaystyle{ \displaystyle f, g }[/math] cannot be negative.

Thus, this is a lower bound on [math]\displaystyle{ \displaystyle E[(w(x))^2] }[/math]. If we replace [math]\displaystyle{ \displaystyle g^*(x) }[/math] into [math]\displaystyle{ \displaystyle E[g^*(x)] }[/math], we can see that the result is as we require. Details omitted.

However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute [math]\displaystyle{ \displaystyle g^* }[/math].

Note: Jensen's inequality is actually unnecessary here. We just use it to get [math]\displaystyle{ E[(w(x))^2] \geq (E[|w(x)|])^2 }[/math], which could be derived using variance properties: [math]\displaystyle{ 0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2 }[/math].

Importance Sampling and Markov Chain Monte Carlo (MCMC) - June 4, 2009