Difference between revisions of "compressed Sensing Reconstruction via Belief Propagation"

Introduction

One of the key theorem in digital signal processing is Shannon/Nyquist theorem. This theorem specifies the conditions on which a band limited signal can be reconstructed uniquely from its discrete samples. This property of band limited signals made signal processing viable on natural analog signals. However, in some applications even the sampled signal lays in a extremely high dimensional vector space. To make signal processing algorithms computationally tractable, many researchers work on compressiblity of signals. Here, it is assumed that most of the information content of a signal lays in a few samples with large magnitude. This lead to study and investigation on a class of signals, known as compressible signals.

Compressible signals can be stored efficiently by ignoring small value coefficients. Naturally it means that during sampling procedure loosing those samples is unimportant. This lead to a natural questions: Can we sample compressible signals in a compressed way? Is there any method to sense only those large value coefficients? In parallel works by Donoho <ref name="R1"> D. Donoho, “Compressed Sensing,” in IEEE Trans. on Info. theory, vol 52, no 4,pp. 1289–1306, Apr. 2006.</ref> and Candes et. al <ref name="R2"> E. Candes, J. Romberg, J.; T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” in IEEE Trans. on Info. theory, vol 52, no 2,pp. 489–509, Feb. 2006.</ref> this question was answered. They showed that this procedure can be done by measuring a small number of random linear projection of the source signal. They also provided a decoding scheme for reconstructing the source signal using those measurements and provided theoretical proofs for different aspects of this topic.

Tow important questions in CS theory is: 1. How many measurement do we need to be able to reconstruct a specific signal in stable and robust way? 2. What is the decoding algorithm complexity? These are vital questions which must be answered in order to make CS theory an alternative procedure for ordinary sampling. These questions have been answered by various researchers. Possibly the most important limitation of CS theory is the complexity of the current decoding algorithms. Those algorithms with strong theoretical background which can be applied to different applications are computationally expensive and can not be used for large scale problems. For this reason one of the active research topics in this field, is to develop tractable CS decoding algorithms. A recently published paper <ref name="R3"> D. Baron, S. Sarvotham, and R.G. Baraniuk, “Compressive Sensing Via Belief Propagation ,” in IEEE Trans. on Signal Processing, vol 58, no 1,pp. 269–280, Jan. 2010.</ref> tries to provide a decoding algorithms via belief propagation (BP). Authors connect CS to graphical models in an interesting approach and then use BP on the corresponding graph to provide a decoding algorithm. In the current paper summary we will follow their approach which indicates the importance of graphical models and the wide range of problems that these models can be used on. It is shown that this approach leads to a decoding algorithm with much less computational load.

Compressed Sensing

Compressive sensing is a method for sampling and representing compressible signals at below the Nyquist rate measurement rates. Let $\ x$ is a $N \times 1$ column vector in $\mathbb{R}^N$ that only has $K \le N$ non-zero entries. To measure this source signal we measure a small number of linear combinations of its elements, $\ M$, as follows

$\ y=\psi x ,$ (1)

where $\ y \in \mathbb{R}^M$ is the vector containing the measurements and $\psi \in \mathbb{R}^{M \times N}$ is a matrix that carries the coordinate basis we use in our application. The goal of CS theory is, given $\psi$ and $\ y$,to recover the source signal in a stable and robust approach.

It is obvious that this problem is ill-posed and infinitely many solutions satisfy (1). Since we assume that our source signal is sparse we can take advantage of this knowledge and assert sparsity as prior to limit the feasible region of solutions. Interestingly it has been proven that under this priority if the source signal is sparse enough then it can be recovered as the solution of optimization problem <ref name="R1"/> and <ref name="R2"/>:

$\ \hat{x}=\textrm{argmin} ||x||_0 , \textrm{s.t.} y=\psi x$ (2)

where $\ ||.||_0$ denotes $\ l_0$ norm that measures the number of non-zero entries. Unfortunately $\ l_0$ optimization is NP hard and needs combinatorial enumeration and thus is intractable. Interestingly, it has been proven that if the sensing matrix satisfies restricted isometry property (RIP), then (2) can be equivalently solves as:

$\ \hat{x}=\textrm{argmin} ||x||_1 , \textrm{s.t.} y=\psi x$ (3)

Here we have replaced $\ l_0$ norm by $\ l_1$ norm which is convex and robust towards additive noise. This optimization can be solved efficiently using linear programming and reconstruction complexity is $\ O(N^3)$ . It has been shown that matrices with independently identical distributed Gaussian or Bernoulli entries satisfy RIP condition and can be used effectively as sensing matrix. In such cases the linear programming decoder requires $\ K log(1+N/K)$ measurements for signal reconstruction.

Compressed Sensing Reconstruction Algorithms

Linear programming is the core approach for designing CS decoding algorithms. This approach is tractable with complexity of $\ O(N^3)$, however this complexity is intractable for large scale problems. For example a $\ 256\times 256$ image will need $\ 2^{48}$ operations for decoding! One the other hand the encoding stage will need $\ O(NM)$ computations for a dense sensing matrix. For this reason one important direction in this filed is to design and simplify the encoding and decoding algorithms.

One alternative decoding approach is iterative greedy algorithm. This approach needs slightly more measurements but it decreases computational complexity significantly. Perhaps the most well know algorithms in this category are Orthogonal Matching Pursuit <ref name="R4"> J. Tropp and A. C. Gilbert, “Signal recovery from partial information via orthogonal matching pursuit,” Apr. 2005, Preprint.</ref> and Matching Pursuit (MP) <ref name="R5"> EM. F. Duarte, M. B. Wakin, and R. G. Baraniuk, “Fast reconstruction of piece-wise smooth signals from random projections ,” in Proc. SPARS05, France, Nov. 2005.</ref>. OMP requires $\ 2K ln(N)$ measurements and the computational complexity is $\ O(KN^2)$. Even though it is better but it still is intractable for large $\ N$ and $\ K$. More advanced versions like StOMP can recover the signal in $\ Nlog(N)$ complexity. It is much faster and can be used for larger class of signals.

The common assumption in these approaches is that sensing matrix is assumed to be dense. Recently some researchers are trying to use sparse sensing matrix and use it to simplify the decoding algorithms. For example chaining pursuit uses this assumption and recovers the signal with complexity $\ K log^2(N) log^2(K)$. Table. 1 summaries CS schemes including their complexity as well as required measurements for each method. These results indicates that we can reduce the complexity only at expense of having more measurements. The last row indicates the algorithm of the current paper. This shows superiority of this algorithm. Later we will see that this superiority comes from using BP, though some simplifying assumption has been made and the method is not general proposed. (Click on the image for table enlargement)

Table.1 CS schemes.

Lessons from LDPC codes applied to CS

At first glance it may seem impossible or at lease complicated to connect CS and a graphical model. The main idea for this procedure comes from the world of error control coding. Sparse error control coding schemes such as LDPC strongly suggests the use of sparse compressed sensing matrices. Can we use sparse CS matrices to encode sparse signals efficiently? Can we take advantage of the sparse structure to reduce the encoding and decoding complexity? To answer these questions we will have a brief review on LDPC code which will clarify how we can assign a graphical model to a CS problem.

An LDPC code is a block code that has a sparse parity check matrix $\ H$. The parity check matrix of an LDPC code can be represented by a biparitite graph (see Fig. 1).

File:graph111.png
Fig.1 CS The graph representation of a CS proble.

There are two class of nodes in the graph. The first class represents the transmitted data and the second class represents the check bits. A check bit is resulted as the linear combination of transmitted data. An edge exists in the graph if a transmitted bit appears in linear combination which produces an arbitrary check bit. This graph representation is not a special characteristics of LDPC codes. What makes it special is that for these codes the graph is sparse.: the number of edges in the graph grows linearly with $\ N$ rather than $\ N^2$. Since the graph is sparse decoding can be done efficiently using BP algorithm on the corresponding graph. This the key idea that suggests that CS decoding algorithms may develop using BP. Though our graph can have cycles but experiments show that even for cyclic graphs BP provides an acceptable approximation of the real answer.

Connecting CS decoding to graph decoding algorithms

Similar to error codes, a sensing matrix can be represented by a biparitie graph.The neighbors of a given measurement node is the set of coefficient nodes that are used to generate that measurement. In order to follow the LDPC scheme we need to choose sensing matrix to be sparse. As the first step in using graphical models in CS we assume the sensing matrix to be very simple: we assume the entries are only belonging to the set $\ \{0,1,-1\}$. Hence the encoding procedure is very simple and only involves summation and difference. This assumption make the corresponding graph to be sparse and hence we can use message passing algorithms for inference (note message passing for a general graph, possibly not sparse, is NP hard problem). We assume that each row of $\ \psi$ contains exactly $\ L$ non zero elements. This number is chosen based on properties of the source signal.

Encoding complexity

If the row weight of $\ \psi$ is equal to $\ L$ and we have $\ M$ measurements then the encoding complexity will be $\ O(ML)$. Choosing $\ M=Nlog(N)/L$ experimental seems to be appropriate and thus $\ O(ML)= O(Nlog(N))$

CS-LDPC decoding of sparse signals

In this section we focus on approximately sparse signals. Similar to other inference problems which use BP we need to know probability distributions. Here we will assume a prior on source signal and then use it in BP procedure.

Source model

Since we want to use graphical model techniques it is necessary to assert a prior on the source signal. Here we will model the source signal by mixture of Gaussians model. This model is simple and captures prior knowledge about the source sparsity.

Let $\ X=[X(1),...,X(N)]$ be a random vector in $\mathbb{R}^N$ and $\ x=[x(1),...,x(N)]$ be an instance of $\ X$. An approximately sparse signal has a few large magnitude coefficients and a lot of small magnitude coefficients. One can assume that we have two states for coefficients. One state is "high" which corresponds to large magnitude coefficients and the other state is "low" which corresponds to small magnitude coefficients. We denote the state of the j'th coefficient by $\ q(j)$, where it is either equal to 1 (high) or 0 (low). The vector $\ q=[q(1),...,q(N)$ which contains these state variables determines the source state configuration. We assume that this vector is an outcome of the state random variable $\ Q=[Q(1),...,Q(N)]$ which is simply assumed to have multinomial distribution. For each $\ x(j)$ we associate for the "high" state an outcome of a high variance Gaussian distribution and for the "low" state an outcome of a low variance Gaussian distribution to model the coefficients. Thus:

$\ X(j)|Q(j)=1 \approx N(0,\sigma_1^2) \textrm{and} X(j)|Q(j)=0 \approx N(0,\sigma_0^2)$ (4)

We assume that coefficients are distributed independently. To have about $\ K$ large magnitude coefficients we choose the Bernoulli parameter of the $\ Q$ equal to $\ S=K/N$. Fig. 2 illustrates the mixture of Gaussian model PDF and Fig. 3 is an outcome of such a distribution for $\ N=50, K=5, \sigma_1^2=30, \sigma_0^2=1$.

File:fig111.png
Fig.2 The mixture of Gaussian model PDF .
File:fig112.png
Fig.3 sample outcome of the mixture of Gaussian model.
Decoding via statistical inference

The framework that was set transforms the original CS problem to a Bayesian inference problem. As we have seen in the course we can used message passing BP algorithm to solve these problems over factor graphs.

As we know (1) is undetermined linear system and assuming $\ \phi$ to be full rank the solution lie in a hyperplane of dimension $\ N-M$. We use MAP and MMSE estimations to for the CS reconstruction. Note that MMSE and MAP estimates are identical for multivariate Gaussian rv's but we need both of them to derive the solution.

$\ \hat{x}_{MMSE}= argmin_{x'}E(X-x') \textrm{s.t}|: y=\phi x'$ (5)
$\ \hat{x}_{MAP}= argmin_{x'}P(X=x') \textrm{s.t} :y=\phi x'$ (6)

First we derive the exact MMSE estimate of the source in our framework. Unfortunately it has exponential complexity and is intractable for large scale systems. Due to sparse structure assumption of $\phi$ we can use message passing scheme to estimate $\ \phi$ approximately but in a computationally robust and stable approach.

Exact solution to CS statistical inference

To obtain MAP and MMSE estimates first we will assume $\ q$ is known and will derive the MMSE estimate given $\ q$. Later we use this estimate to derive MMSE estimate. Let covariance matrix of $\ X$ conditioned on $\ q$ is denoted by $\ \Sigma_q$. It is a diagonal matrix with the j'th diagonal entry equal to $\ \sigma_0^2$ if the state is low and $\ \sigma_1^2$ if the state is high.

Theorem 1

Given $\ y=\psi x$, then the MMSE and the MAP estimate of $\ x$ conditioned on knowing the state vector is given:

$\ \hat{x}_{MMSE,q}=\hat{x}_{MAP,q}=\Sigma_q \Phi^T (\Phi\Sigma_q \Phi^T)^{-1}y$ (7)

Proof: If $\ q$ is known then we have:

$\ P[X=x|Q=q]=\frac{1}{(2\pi)^{N/2}det(\Sigma_q)^{1/2}} e^{-\frac{1}{2}x^T\Sigma_q^{-1}x}$ (8)

We do change of variable $\ \theta=\Sigma_q^{-1/2} x'$. Then we have $\ \theta_{MAP}=argmin_{\theta}{\theta^T\theta}$ such that $\ y=\Phi \Sigma_q^{1/2}\theta$. This is a simple least square problem and the solution can be computed using pseudo inverse: $\ |theta_{MAP}=\Sigma_q^{1/2} \Phi^T (\Phi\Sigma_q \Phi^T)^{-1}y$. Inverse change of the variable gives the result. Finally as mentioned MMSE and MAP estimates are the same for multivariate Gaussian random variables.

Now we use this conditional solution to result in the general solution.

Theorem 2

The MMSE estimate of $\ x$ when the state configuration is unknown is given by

$\ \hat{x}_{MMSE}=\Sigma_{q\in[0,1]^N}P[Y=y|Q=q]P[Q=q]\hat{x,q}_{MMSE}$ (8)

The expected value of MMSE error obeys the equalities:

$\ \hat{x}_{MMSE}=E[X|Y=y]=\Sigma_{q\in[0,1]^N}E[X,Q=q|Y=y]=\Sigma_{q\in[0,1]^N}P[Q=q|Y=y]E[X|Q=q,Y=y]$ $\ =\Sigma_{q\in[0,1]^N}P[Q=q|Y=y]\hat{x}_{MMSE,q}=\Sigma_{q\in[0,1]^N}P[Y=y|Q=q]P[Q=q]\hat{x}_{MMSE,q}$

which completes the proof.

Since the summation is done on all outcomes of $\ q$, it is intractable for large $\ N$. However sparse structure of sensing matrix makes it possible to use message passing algorithms such as BP to approximate the solution. It is tried to use a message passing approach to compute the approximate marginals for $\ [X(j)|Y]$ and thus evaluate the MMSE and MAP estimates for each coefficient.

Approximate solution to CS statistical inference via message passing

We employ BP over the model factor graph. The factor graph shown in Fig. 4 captures the relationship between the signal coefficients, their states, and the observed CS measurements. We have two types of nodes: variable nodes (black) and constraint nodes (white). All edge connects a variable node to a constraint node. We have three types of variables: states $\ Q(j)$, coefficients $\ X(j)$ and the measurements $\ Y(i)$ and three types of constraint nodes: prior constraint nodes which impose Bernoulli prior on state variables, constraint nodes which connects the state variables and the coefficient variables which impose the conditional distribution of the coefficient value given the state and the third type connect one measurements to the corresponding coefficients. We employ BP to infer the probability distributions of the coefficient and state variables conditioned on state variables, i.e. $\ P[X(j)|Y=y]$,$\ P[Q(j)|Y=y]$. Then we can derive MAP/MMSE estimate for each coefficients. We review some interesting details: 1. We have only two types of message passing:multiplication of beliefs at the variable nodes and convolution at the constraint node. 2. We propose two strategies to encode the beliefs:

• Approximating the distribution by a mixture of Gaussian with a given number of components and we use the parameters of the mixture of the Gaussian as the messages.
• We sample PDF's and use the samples as the messages.
File:cor.png
Fig.4 Factor graph for a CS reconstruction problem.
Decoding complexity

Consider the sampled PDF approach first. Let each message be a vector of $\ p$ samples. At a variable node we multiply $\ R+1$ vectors of length $\ p$ so for one iteration we have $\ O(NRp)$ complexity. At a constraint node we perform convolution in DFT domain so for each node the computational complexity is $\ O(Lp+Lplog(p))=O(Lplog(p))$ and overall $\ O(MLplog(p))$ for $\ M$ nodes. For $\ log(N)$ iterations the overall complexity is $\ O(MLplog(p)log(N))$. As a rule of thumb we chose $\ p=\sigma_1/\sigma_0$ and hence the complexity can be expressed as $\ O(ML\frac{\sigma_1}{\sigma_0}log(\frac{\sigma_1}{\sigma_0}))=O(Npolylog(N))$. In the second scenario let $\ m$ be the maximum model order that we deal with. The complexity per iteration for each coefficient nose is $\ O(m^2R^2)$ and $\ O(Nm^2R^2)$ for $\ N$ nodes. Similarly the complexity for $\ M$ constraint nodes is given by $\ O(m^2L^2M)$. Thus the over all complexity for $\ log(N)$ iterations is: $\ O(m^2LM(R+L)log(N))$. Assuming $\ L=O(N/K)$, $\ R=O(log(N/K))$, $\ M=O(Klog(N/K))$, the overall complexity is $\ O(m^2log(1/S)log(N)(N/S))$.

Numerical results

The results are taken from the reference paper. The performance is studied for several level of sparsity and $\ \sigma_1/\sigma_0$ ratios. Fig. 5 shows results for this purpose ($\ N=6000, \sigma_1/\sigma_0=30$ ). Note as $\ L$ is increased we require less measurements to obtain the same accuracy.

File:res.png
Fig.5 Reconstruction Results.

Conclusions

In this paper BP was employed successfully for problem of CS reconstruction. Though some assumptions are made which limits the algorithm in practice but it lead to promising results. The proposed method is much faster than previous methods and requires less measurements. One good research direction for future is to try to propose non parametric algorithms for CS problem.

Donoho's compressed sensing

In the introduction, the work proposed earlier by Donoho was referred to. Here we try to give an overview of such work. The model supposes the following, assuming $x$ is a signal that requires $m$ samples to be reconstructed, and that it is known a priori that $x$ is compressible by transform coding with a known transform, and it is possible to acquire data of $x$ by measuring $n$ general linear functions, rather than the ordinary pixels. If the collection of linear functions is well chosen, and a certain degree of reconstruction error is allowed, the size of $n$ can be massively smaller than $m$. Therefore, certain natural classes of signals or images with $m$ pixels need only $n = O(m^{\frac{1}{4}}log^{\frac{2}{5}}(m))$ non-adaptive non-pixel samples for faithful recovery, as opposed to the ordinary $m$-pixel samples.

Donoho supposes that the object has a sparse representation in some orthonormal basis, i.e. the coefficients belong to an $l^p$ ball for $0 \lt p \leq 1$. This implies that the $N$ most important coefficients in the expansion allow a reconstruction with $l^2$ error $O(N^{\frac{1}{2}-\frac{1}{p}})$. It is possible to design $n=O(N*log(m))$ non-adaptive measurements which contain the information necessary to reconstruct any such object with accuracy comparable to the accuracy which would be possible if the $N$ most important coefficients of that object were directly observable. In addition, a good approximation to those $N$ important coefficients may be extracted from the $n$ measurements by solving a suitable linear program called Basis Pursuit. The non-adaptive measurements own the character of random linear combinations of basis elements.

Experiments showed that most subspaces are near-optimal, and that convex optimization can be used for processing information derived from such near-optimal subspaces.

<references />