# markov Random Fields for Super-Resolution

A Summary on
Markov Networks for Super-Resolution
by
W. T. Freeman and E. C. Pasztor

## Introduction

There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.

## Markov Networks for low-level vision problems

A common graphical model for low-level vision problems is Markov networks. For a given "image", $y$, the underlying "scene" $x$ should be estimated. The posterior probability, $P(x|y)= cP(x,y)$ is calculated considering the fact that the parameter $c = 1/P(y)$ is a constant over $x$. The best scene estimate $\hat{x}$ is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the $\hat{x}$ is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure
Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.
depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors.

To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” $x$ and the “image” $y$ is given by:

$P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)$

where $\psi$ and $\phi$ are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j[/itex] by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:

$\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)$
$\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)$

For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.

### Inference in Networks without loops

For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2
Fig.2 Example Markov network without any loop, used for belief propagation example described in text.
the MAP estimation for node $j$ is determined by:
$\begin{matrix} \hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\ & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\ & = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\ \end{matrix}$

The similar expressions for $x_{2MAP}$ and $x_{3MAP}$ can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is

$\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)$

Where k runs over all “scene” node neighbors of node j, and $M^k_j$ is the message from node k to node j. The $M^k_j$ message is calculated by:

$M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)$

where $\hat{M}^l_k$ is $M^k_l$ from the previous iteration. The initial $\hat{M}^k_j$'s are set to column vector of 1’s, with the same dimension as $x_j$. Because the initial messages are 1’s at the first iteration, all the message in the network are:

$M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)$
$M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)$
$M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)$
$M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)$

The second iteration uses the messages above as the $\hat{M}$ variables in Eq(6) :

$M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)$
$M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)$
$M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)$
$M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)$

And thus

$M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)$

Eventually the MAP estimates for $x_1$ becomes:

$\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)$

The MMSE estimate Eq(3) has analogous formulate with the $max_{x_k}$ of Eq(8) replaced by $\sum_{x_k}$ and $\arg\max_{x_j}$ replaced by $\sum_{x_j} x_j$.

### Networks with loops

Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3
Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.

## General view of the paper

Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.

According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, $y_i$’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.

## Implementation of low-level vision problems using Markov network

The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions $\psi$ and $\phi$ should be determined. A nice idea is to define Gaussian mixtures over joint spaces $x_i*x_j$ and $x_j*x_k$;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch.

For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4
Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.
illustrates an example of each patch in y and the associated "scene" candidates.

### Learning the Potential (Compatibility) Functions

The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5
Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.

Therefore, the scene patches themselves may be use to define potential functions $\psi$ between nodes $x_i$’s. Recall for node $x_i$ and its neighbor $x_j$ there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in $x_j$ ($d^l_{kj}$) and their correspondent in $x_k$ ($d^m_{jk}$) are some variation of each other , and thus eventually the potential function between node $x_j$ and node $x_k$ are given by:

$\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}$

Where $\sigma_s$ has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i[/itex] and $\sigma_s$, respectively. Therefore, $\psi$ function between node j and k can be represented by a matrix whose lth row and mth column is $\psi(x^l_k, x^m_j)$. Potential function $\phi$ which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate $x^k_l$ is compatible with an observed image patch $y_0$ if the image patch, $y^k_l$, associated with the scene candidate $x^k_l$ in the training database matches $y_0$. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:

$\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}$

## Super-Resolution

For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:

$P(H|M,L) = P(H|M)$

Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrast normalized version of the high frequency components. The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of $\phi$’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.

Figure5