compressive Sensing (Candes)

From statwiki
Revision as of 16:22, 1 December 2010 by Fatemeh (talk | contribs) (Formulation)
Jump to: navigation, search


In order to make quantitative observations about our environment, we must often acquire signals. And to acquire these signals, we must have some minimum number of samples in order to exactly reconstruct the signal. This minimum number of samples, or rather the minimum rate at which the signal must be sampled, is called the Nyquist rate.

So, we have been stuck with either following the Nyquist criterion, or else being left to deal with imprecise results, corrupted with artifacts like aliasing. This is particularly problematic when sampling is very expensive. For example, what if the measurement device is expensive? Or what if the time requirements are particularly expensive, as in the case of MRI?

Recently, work by Emmanuel Candes and others <ref name="CandesWakin">E. Candes and M. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine,25(2):21-30,2008</ref><ref name="CandesTao">E. Candes and T. Tao. Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? IEEE Transactions on Information Theory, 52(12):5406-5425, 2006.</ref> demonstrated that it is possible to exactly reconstruct some signals with undersampled data. If a signal is sparse, meaning that it does not have many features, we can take far fewer measurements and utilize knowledge of the structure of the signal to infer the rest. And, it turns out, the results are astonishingly good; we can exactly reconstruct sparse signals with far fewer measurements than were previously thought possible.

The following summarizes the work presented in <ref name="CandesWakin" />.


A sampled signal can be modeled as

[math]y = \Phi x[/math]

where [math]y[/math] is an [math]m[/math] dimensional vector of sampled values, [math]x[/math] is an [math]n[/math] dimensional vector of the exact signal, and [math]\Phi[/math] is a [math]m[/math] by [math]n[/math] matrix which samples the exact signal. An example of this could be [math]y[/math] as the CCD sensor measurements in a digital camera, [math]\Phi[/math] the acquisition process, and [math]x[/math] the real world. If the signal is well-posed, meaning we've done a sufficient amount of sampling, we'll be able to reconstruct [math]x[/math] perfectly. But what if [math]m \lt \lt n[/math], meaning that we've undersampled? This becomes an ill-posed or under-determined problem.

Now, consider that many signals can be summarized succinctly in some other basis - that is, the problem of compression or dimensionality reduction. Returning to the camera example, the camera may apply a discrete cosine transform (DCT) for JPEG compression in order to reduce the amount of space required to store the image. So, we know it is possible to compress sparse signals, but only once we have a sufficiently large number of samples.

Compressive sensing combines these two stages, sensing (sampling) and compression, in order to solve this under-determined problem. It asks, if we can compress our exact signal [math]x[/math] into some compressed basis [math]\Phi[/math] where it can be represented sparsely as a signal [math]s[/math], then can we solve that same under-determined system if [math]m[/math] is approximately equal to [math]s[/math]? If our sparse signal [math]s[/math] is k-sparse, meaning it has [math]k[/math] nonzero elements, we can fix our solution for [math]s[/math] in [math]k[/math] dimensions, and do some kind of optimization for the remaining [math]n-k[/math] elements. Mathematically,

[math]y = \Phi x = \Phi \Psi s = \Theta s[/math]

Where [math]\Psi[/math] is a compressed basis, and [math]\Theta[/math] is the compressive sensing matrix---the product of the compression and sensing bases. So, we can take some small number of samples [math]y[/math], compute the sparse representation [math]s[/math] of our exact signal [math]x[/math], and then apply the inverse compression approximation to recover [math]x[/math].

So the two fundamental premises underlying CS: sparsity and incoherence.

Sparsity:The implication of sparsity is now clear: when a signal has a sparse expansion, one can discard the small coefficients without much perceptual loss. Formally, consider [math]f_{S}(t)[/math] obtained by keeping only the terms corresponding to the [math]S[/math] largest values of [math](x_i)[/math] in the expansion [math]f(t)=\sum_{i=1}^{n}x_i\psi_i(x)[/math].By definition, [math]f_S =\Psi x_S[/math], where [math]x_S[/math] is the vector of coefficients [math](x_i)[/math] with all but the largest [math]S[/math] set to zero.

Incoherence: In plain English, the coherence measures the largest correlation between any two elements of [math]\Phi[/math] and [math]\Psi[/math]. If [math]\Phi[/math] and [math]\Psi[/math] contain correlated elements, the coherence is large. Otherwise, it is small.

Ideally, we would like to measure all the [math]n[/math] coefficients of [math]f[/math], but we only get to observe a subset of these and collect the data .But, to do this, we must answer three broad questions:

  • How do we pick [math]\Theta[/math]?
  • How can we do the optimization, and how can we be sure it will work?
  • Is this algorithm practical? How does it work with noise?

Choosing a Compressive Sensing Matrix [math]\Theta[/math]

To guarantee that that the compressive sensing matrix is stable, the authors derive the Restricted Isometry Property (RIP) in <ref name="CandesTao2">E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. Inform. Theory, vol. 51, no. 12, Dec. 2005.</ref>. Mathematically this property can be stated as,

[math](1-\delta_k)\|x\|_{\ell_2}^2 \le \|A x\|_{\ell_2}^2 \le (1+\delta_k)\|x\|_{\ell_2}^2[/math]

This formula implies that [math]A[/math] must be a distance preserving transformation for all k-sparse vectors [math]x[/math], bounded by some constant [math]\delta_k[/math], known as the restricted isometry constant. [math]A[/math] satisfies the RIP of order [math]k[/math] if [math]\delta_k[/math] is not too close to 1. When this property holds, all [math]k[/math]-subsets of the columns of [math]A[/math] are nearly orthogonal. If this property does not hold then it is possible for a [math]k[/math]-sparse signal to be in the null space of [math]A[/math] and in this case it may be impossible to reconstruct these vectors.

Evaluating the RIP property is itself intractable (NP-hard), so while it serves as a theoretical foundation for compressive sensing, it still does not reveal a great deal in telling us how to build the compressive sensing matrix.

To do this, let's have a look at how [math]\Theta[/math] is constructed.

[math]\Theta = \Phi \Psi[/math]

Is there something about these bases, or about their relationship with each other, that will guarantee RIP? Bases which are maximally incoherent will satisfy this restricted isometry property, so we can use this to guide us as we choose an appropriate compressive sensing matrix. Mathematically, coherence is defined as,

[math]\mu(\Psi, \Phi) = \sqrt{n} \max_{1 \le k, j \le n} |\langle \psi_k,\phi_j \rangle|[/math]

Note that this function takes on values between 1 and [math]\sqrt{n}[/math]. So, to find a suitable [math]\Theta[/math], we'll need to find two bases which are incoherent.

We know [math]\Phi[/math] will model the sampling phase and [math]\Psi[/math] the compression, so we must at least select matrices with these properties.

Based on the definition of coherence, spikes and sinusoids will be incoherent, so we could choose [math]\Phi[/math] as a matrix of impulses and [math]\Psi[/math] as a compression basis of sinusoids. Another option could be noiselets and wavelets for the sensing and compression, respectively. But must we be so restricted?

It turns out that random noise is incoherent with just about everything. Namely, in most cases, taking iid samples from some random distribution (i.e. a Gaussian with mean of zero or uniformally from a unit sphere in [math]\mathbb{R}^m[/math]) will result in a matrix violating the RIP property with exponentially small (in m) probability. Thus, using these sorts of sensing matrices (or others than we can prove have the same properties) are ideal to use our sampling matrix.

Now, we know how to construct a stable compressive sensing matrix [math]\Theta[/math], but how do we actually solve the under-determined system?


We can start with the standard approach---least squares. It's a convenient choice since it has a closed-form solution, but it turns out to give very bad results in this context.

What about the [math]\ell_0[/math] norm? Intuitively, this counts the number of zeros in the vector - that's exactly what we want to minimize! But, it turns out that we'd have to try every [math]{n \choose k}[/math] combination of zeros to find the solution, which is NP-hard, and thus intractable. The authors discovered that we can solve this using the [math]\ell_1[/math] norm and obtain exact results. They concluded that the reconstruction is nearly as good as what an oracle can provide us by a perfect knowledge of the object.

Figure 1: illustrates the exact result given by the [math]\ell_1[/math] compared to the inaccuracy of, for example, the [math]\ell_2[/math] norm.

So the problem we are trying to solve can be stated formally as:

[math]\min_{s \in \mathbb{R}^n} \|s\|_1 \quad \mbox{subject to} \quad y = \Theta s[/math].

This is a well-established problem known as basis pursuit. Basis pursuit problems can easily be transformed into a linear programming problem. Suppose we let every [math]s_i = s_i^+-s_i^-[/math] where both [math]s_i^+[/math] and [math]s_i^-[/math] are non-negative. Then [math]|s_i| = s_i^++s_i^-[/math] and the optimization problem can be rewritten as

[math]\min_{s^+,s^- \in \mathbb{R}^n} \sum_i s_i^++s_i^ \quad \mbox{subject to} \quad y = \Theta s^+ + \Theta s^- \mbox{ and } s^+, s^- \geq 0[/math].

This problem can be solved using the simplex method (linear programming), and has a worst-case complexity of [math]O(n^3)[/math].

Now, we know how to construct and solve this system. But, how well does it work? And, more specifically, how well does it work with noise?

Performance and Robustness

First, let's address the question of how well the technique works. The result matches intuition---the number of samples required to reconstruct a signal is related to the incoherence of bases, sparsity and the dimension of the signal we wish to reconstruct. The minimum bound of [math]m[/math], the number of samples, is given by,

[math]m \geq C K \log(\frac{N}{K})[/math]

if the bases are maximally incoherent, for example if we use random projections. In practice, the typical number of samples has been found to be approximately [math]m \ge 4k[/math].

However, even if [math]m[/math] is sufficiently small, this isn't practical if the method is not robust. Let's consider our basic formulation, and add some error,

[math]y = \Theta s + z,[/math]

where [math]z[/math] is some random noise, for example Gaussian white noise [math]\,z \sim N(0,1/m)[/math]. To solve this, let's relax the original [math]l_1[/math] formulation to

[math]\min_\tilde{s} \|\tilde{s}\|_{l_1} \quad \mbox{subject to} \quad \| \Theta \tilde{s} - y \| \le \epsilon[/math],

where [math]\epsilon[/math] is the error bound. This is closely related to the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm. Like the original [math]l_1[/math] formulation, it aims to minimize the sum of elements of [math]s[/math], making it appropriate for compressed sensing.

With noise considered, the authors report an error rate of,

[math]\|\tilde{s} - s \|_{l_2} \le C_0 \| s - s_K \|_{l_1} / \sqrt{k} + C_1 \epsilon,[/math]

where [math]s[/math] is the true signal, [math]s_k[/math] is the true signal where all but the [math]k[/math] largest non-zero entries are included and [math]\tilde{s}[/math] is the estimated values of [math]s[/math]. Intuitively, this means the total error is bounded by the error one would obtain without noise, and the error due to the noise itself. The authors report [math]C_0[/math] and [math]C_1[/math] to be typically small. This means the noise levels output by the system are directly proportional to the noise levels input. This is a very good result, since a bit of noise does not cause the algorithm to perform any worse.


The paper provides an idea of some ways compressive sensing can be used in practice:

  • Data Compression - if [math]\Psi[/math] is unknown or impractical for encoding (and used only for decoding), then a randomly generated [math]\Phi[/math] (i.e. that is non-RIP with exponentially small probability as discussed above) can be used for encoding.
  • Acquiring Data - It might be desirable (or the only practical option) to get a full set of discrete samples from a signal. CS can guide the creation of hardware to solve this problem.


We have seen that we can reconstruct signals with far fewer measurements than were previously thought possible. Now, the number of measurements is largely governed by the sparsity of the signal. Ultimately, this has huge implications for the foundations of electronic measurement and data acquisition.


<references />