markov Random Fields for Super-Resolution: Difference between revisions

From statwiki
Jump to navigation Jump to search
(Created page with "<center> A Summary on <br /> '''Markov Networks for Super-Resolution''' <br /> by <br /> W. T. Freeman and E. C. Pasztor <br /> </center> == Introduction == There are some appl...")
 
No edit summary
Line 104: Line 104:
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.


== Implementation ==
== Implementation of low-level vision problems using Markov network ==


The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch.  
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch.  


For each patch in "image" y a set of 10 or 20 “scene” candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.
 
=== Learning the Potential (Compatibility) Functions ===
 
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure….
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:
 
<center><math>
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}
</math></center>
 
Where <math>\sigma_s</math>  has to be determined. The authors assume that the image and scene training samples differ from the “ideal” or original high resolution image by Gaussian noise of covariance \sigma_i and \sigma_s, respectively. Threfore, \psi function between node j and k can be represented by a matrix whose lth row and mth column is \psi(x^l_k, x^m_j).
Potential function \phi which is defined between “scene” node x and “image” node y is determined based on another intuitive assumption. We say that a “scene” candidate xkl is compatible with an observed image patch y0 if the image patch, ykl, associated with the scene candidate xkl in the training database matches y0. Ofcourse, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:
Eq(23)

Revision as of 08:17, 12 November 2011

A Summary on
Markov Networks for Super-Resolution
by
W. T. Freeman and E. C. Pasztor

Introduction

There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.

Markov Networks for low-level vision problems

A common graphical model for low-level vision problems is Markov networks. For a given "image", [math]\displaystyle{ y }[/math], the underlying "scene" [math]\displaystyle{ x }[/math] should be estimated. The posterior probability, [math]\displaystyle{ P(x|y)= cP(x,y) }[/math] is calculated considering the fact that the parameter [math]\displaystyle{ c = 1/P(y) }[/math] is a constant over [math]\displaystyle{ x }[/math]. The best scene estimate [math]\displaystyle{ \hat{x} }[/math] is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the [math]\displaystyle{ \hat{x} }[/math] is difficult to compute. Therefore the “image” and “scene” are divided into patches and one node of the Markov network is assigned to each patch. Figure File:1 depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors.

To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” [math]\displaystyle{ x }[/math] and the “image” [math]\displaystyle{ y }[/math] is given by:

[math]\displaystyle{ P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1) }[/math]

where [math]\displaystyle{ \psi }[/math] and [math]\displaystyle{ \phi }[/math] are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:

[math]\displaystyle{ \hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2) }[/math]
[math]\displaystyle{ \hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3) }[/math]

For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.

Inference in Networks without loops

For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.3 the MAP estimation for node [math]\displaystyle{ j }[/math] is determined by:

[math]\displaystyle{ \begin{matrix} \hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\ & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\ & = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\ \end{matrix} }[/math]

The similar expressions for [math]\displaystyle{ x_{2MAP} }[/math] and [math]\displaystyle{ x_{3MAP} }[/math] can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is

[math]\displaystyle{ \hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5) }[/math]

Where k runs over all “scene” node neighbors of node j, and [math]\displaystyle{ M^k_j }[/math] is the message from node k to node j. The [math]\displaystyle{ M^k_j }[/math] message is calculated by:

[math]\displaystyle{ M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6) }[/math]

where [math]\displaystyle{ \hat{M}^l_k }[/math] is [math]\displaystyle{ M^k_l }[/math] from the previous iteration. The initial [math]\displaystyle{ \hat{M}^k_j }[/math]'s are set to column vector of 1’s, with the same dimension as [math]\displaystyle{ x_j }[/math]. Because the initial messages are 1’s at the first iteration, all the message in the network are:

[math]\displaystyle{ M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7) }[/math]
[math]\displaystyle{ M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8) }[/math]
[math]\displaystyle{ M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9) }[/math]
[math]\displaystyle{ M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10) }[/math]

The second iteration uses the messages above as the [math]\displaystyle{ \hat{M} }[/math] variables in Eq(6) :

[math]\displaystyle{ M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11) }[/math]
[math]\displaystyle{ M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12) }[/math]
[math]\displaystyle{ M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13) }[/math]
[math]\displaystyle{ M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14) }[/math]

And thus

[math]\displaystyle{ M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15) }[/math]

Eventually the MAP estimates for [math]\displaystyle{ x_1 }[/math] becomes:

[math]\displaystyle{ \hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16) }[/math]

The MMSE estimate Eq(3) has analogous formulate with the [math]\displaystyle{ max_{x_k} }[/math] of Eq(8) replaced by [math]\displaystyle{ \sum_{x_k} }[/math] and [math]\displaystyle{ \arg\max_{x_j} }[/math] replaced by [math]\displaystyle{ \sum_{x_j} x_j }[/math].

Networks with loops

Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops figure...


General view of the paper

Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.

According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, [math]\displaystyle{ y_i }[/math]’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.

Implementation of low-level vision problems using Markov network

The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions [math]\displaystyle{ \psi }[/math] and [math]\displaystyle{ \phi }[/math] should be determined. A nice idea is to define Gaussian mixtures over joint spaces [math]\displaystyle{ x_i*x_j }[/math] and [math]\displaystyle{ x_j*x_k }[/math];however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch.

For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.

Learning the Potential (Compatibility) Functions

The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure…. Therefore, the scene patches themselves may be use to define potential functions [math]\displaystyle{ \psi }[/math] between nodes [math]\displaystyle{ x_i }[/math]’s. Recall for node [math]\displaystyle{ x_i }[/math] and its neighbor [math]\displaystyle{ x_j }[/math] there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in [math]\displaystyle{ x_j }[/math] ([math]\displaystyle{ d^l_{kj} }[/math]) and their correspondent in [math]\displaystyle{ x_k }[/math] ([math]\displaystyle{ d^m_{jk} }[/math]) are some variation of each other , and thus eventually the potential function between node [math]\displaystyle{ x_j }[/math] and node [math]\displaystyle{ x_k }[/math] are given by:

[math]\displaystyle{ \psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s} }[/math]

Where [math]\displaystyle{ \sigma_s }[/math] has to be determined. The authors assume that the image and scene training samples differ from the “ideal” or original high resolution image by Gaussian noise of covariance \sigma_i and \sigma_s, respectively. Threfore, \psi function between node j and k can be represented by a matrix whose lth row and mth column is \psi(x^l_k, x^m_j). Potential function \phi which is defined between “scene” node x and “image” node y is determined based on another intuitive assumption. We say that a “scene” candidate xkl is compatible with an observed image patch y0 if the image patch, ykl, associated with the scene candidate xkl in the training database matches y0. Ofcourse, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in: Eq(23)