graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns

From statwiki
Revision as of 15:18, 22 November 2011 by Fewzee (talk | contribs) (→‎Background)
Jump to navigation Jump to search

Background

In standard supervised classification problems, the label of each unknown class is independent of the labels of all other instances. In some problems, however, we may receive multiple test instances at a time, along with side information about dependencies among the labels of these instances. For example, if each instance is a handwritten character, the side information might be that the string of characters forms a common English word; or, if each instance is a microscope image of a cell with a certain protein tagged, the side information might be that several cells share the same tagged protein. In such scenarios it is useful to take advantage of the side information and incorporate the side information in the classification to improve the algorithms. To solve such a structured classification problem in practice, we need both an expressive way to represent our beliefs about the structure, as well as an efficient probabilistic inference algorithm for classifying new groups of instances. The target of structured classification is to break down a complex problem into a number of more simpler ones and to accordingly redefine the set of classification labels, so that the resulting problem can be solved in a more efficient manner. In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as factor graphs can represent arbitrary dependencies among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models used. A new class of potential functions is proposed, which is called decomposable k-way potentials. It provides efficient algorithms for computing messages from these potentials during belief propagation. These new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. Three instances of decomposable potentials are discussed: the associative Markov network potential, the nested junction tree, and the voting potential. The new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy.

Factor Graphs

The factor graph representation of a probability distribution describes the relationships among a set of variables [math]\displaystyle{ x_i }[/math] using local factors or potentials [math]\displaystyle{ \phi_j }[/math]. Factor graphs are very useful and enable us to represent directed and undirected models in a unique theory. Each factor depends on only a subset of the variables, and the overall probability distribution is the product of the local factors, together with a normalizing constant Z:

[math]\displaystyle{ P(x) = \frac{1}{Z} \prod_{factors j} \phi_j(x_{V_j}) }[/math]

Here [math]\displaystyle{ V(j) }[/math] is the set of variables that are arguments to factor [math]\displaystyle{ j }[/math]; for example, if [math]\displaystyle{ \phi_j }[/math] depends on [math]\displaystyle{ x_1, x_3 }[/math],and [math]\displaystyle{ x_4 }[/math], then [math]\displaystyle{ V(j) = \{1,3,4\} }[/math] and [math]\displaystyle{ x_{V(j)} = (x_1, x_3, x_4) }[/math].

Each variable [math]\displaystyle{ x_i }[/math] or factor [math]\displaystyle{ \phi_j }[/math] corresponds to a node in the factor graph. Fig. 1 shows an example: the large nodes represent variables, with shaded circles for observed variables and open circles for unobserved ones. The small square nodes represent factors, and there is an edge between a variable [math]\displaystyle{ x_i }[/math] and a factor [math]\displaystyle{ \phi_j }[/math] if and only if [math]\displaystyle{ \phi_j }[/math] depends on [math]\displaystyle{ x_i }[/math], that is, when [math]\displaystyle{ i \in V(j) }[/math]. (By convention the graph only shows factors with two or more arguments. Factors with just a single argument are not explicitly represented, but are implicitly allowed to be present at each variable node.) The notion of dependency is very general and it can model causality for directed graphs as well as bilateral relations in undirected models. The inference task in a factor graph is to combine the evidence from all of the factors to compute properties of the distribution over [math]\displaystyle{ x }[/math] represented by the graph. Naively, we can do inference by enumerating all possible values of [math]\displaystyle{ x }[/math], multiplying together all of the factors, and summing to compute the normalizing constant. Unfortunately, the total number of terms in the sum is exponential in the number of random variables in the graph. So, usually, a better way to perform inference is via a message-passing algorithm called belief propagation (BP). BP results in exact solution for the graphs which don't contain loops, however it also provides an acceptable approximation for the loopy graphs if used iteratively. It has been shown in numerous applications it converges in suitable time. The theoretical justification behind loopy BP is still under study.

Belief Propagation

Just to stick with the notion of the authors, let's assume that [math]\displaystyle{ \phi_i^{loc}(x_i) }[/math] is the one-argument factor that represents the local evidence on [math]\displaystyle{ x_i }[/math]. Moreover, Figure 1 shows the notion they use in graphs. The small squares denote potential functions, and, as usual, the shaded and unshaded circles represent observed and unobserved variables respectively.

File:sum fig1.JPG
Fig.1: A probability distribution represented as a factor graph

Using such notion, the message sent from a variable [math]\displaystyle{ x_i }[/math] to a potential function [math]\displaystyle{ \phi_k }[/math] as:

[math]\displaystyle{ m_{i \rightarrow k}(x_i)=\phi_i^{loc}(x_i)\prod_{j=1}^{k-1}m_{j \rightarrow i}(x_i)\text{ }(1) }[/math]

Similarly, a message from a potential function [math]\displaystyle{ \phi_j }[/math] to [math]\displaystyle{ x_k }[/math] can be computed as:

[math]\displaystyle{ m_{j \rightarrow k}(x_k)=\sum_{x_1}\sum_{x_2}...\sum_{x_{k-1}}\phi_j(x_1,...,x_k)\prod_{i=1}^{k-1}m_{i \rightarrow j}(x_i)\text{ }(2) }[/math]

General graphs

The above is easily applied when the graph is tree-shaped. For graphs with loops, there are generally two alternatives, the first is to collapse groups of variable nodes together into combined nodes, which could turn the graph into a tree and makes it feasible to run Belief Propagation (BP). When a set of variable nodes are combined, the new node represents all possible settings of all of the original nodes. For example, if we collapse a variable [math]\displaystyle{ x_1 }[/math] that has settings [math]\displaystyle{ T;F }[/math] with a variable [math]\displaystyle{ x_2 }[/math] that has settings [math]\displaystyle{ A;B;C, }[/math] then the combined variable [math]\displaystyle{ x_{1,2} }[/math] has settings [math]\displaystyle{ TA;TB;TC;FA;FB;FC }[/math]. The second is to run an approximate inference algorithm that doesn't require a tree-shaped graph. One further solution is to combine both techniques. An example is to derive a tree-shaped graph for the graph shown in Figure 1. Figure 2 combines variables [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math] to form the graph in Figure 2. The potentials [math]\displaystyle{ \phi_{23} }[/math] and [math]\displaystyle{ \phi_{123} }[/math] from the original graph have the same set of neighbors in the new graph, and so can be combined into one factor node. Similarly, the local potentials [math]\displaystyle{ \phi_{1}^{local} }[/math] and [math]\displaystyle{ \phi_2^{loc} }[/math] can be combined with the factor [math]\displaystyle{ \phi_{12} }[/math] to form a new local potential at the collapsed node [math]\displaystyle{ x_{12} }[/math]. Notice that the new factor graph is tree-shaped, even though the original one had loops.

File:sum fig2.JPG
Fig.2: A tree-shaped factor graph representing the graph in Fig.1

By reducing a factor graph all the way down to a tree, we will have the equivalent junction tree. An interesting property of the junction trees is the running intersection property, that is a variable will be present at all the nodes along the path between to nodes of A and B, if it is present at A and B themselves. This property only holds, when there is a unique path between A and B and that should be the case, as we are dealing with a tree, which has the aforementioned property, generalized for each and every two nodes.

Loopy Belief Propagation (LBP)

If a graph is collapsed all the way to a tree, inference can be done with the exact version of BP as above. If there are still some loops left, it's LBP that should be used. In LBP (as in BP), an arbitrary node is chosen to be the root and formulas 1 & 2 are used. However, each message may have to be updated repeatedly before the marginals converge. Inference with LBP is approximate because it can double-count evidence (in fact in the first glance it doesn't seem to work and only experiments shows its usefulness); messages to a node [math]\displaystyle{ i }[/math] from two nodes [math]\displaystyle{ j }[/math] and [math]\displaystyle{ k }[/math] can both contain information from a common neighbor [math]\displaystyle{ l }[/math] of [math]\displaystyle{ j }[/math] and [math]\displaystyle{ k }[/math]. If LBP oscillates between some steady states and does not converge, the process could be stopped after some number of iterations. Oscillations can be avoided by using momentum, which replaces the messages that were sent at time [math]\displaystyle{ t }[/math] with a weighted average of the messages at times [math]\displaystyle{ t }[/math] and [math]\displaystyle{ t-1 }[/math]. For either exact or loopy BP, run time for each path over the factor graph is exponential in the number of distinct original variables included in the largest factor. Therefore, inference can become prohibitively expensive if the factors are too large.

Constructing factor graphs for structured classification

To construct factor graphs that encode "likely" label vectors, two steps are performed. First, domain specific heuristics are used to identify pairs of examples whose labels are likely to be the same in order to use such pairs to build a similarity graph with an edge between each pair of examples. The second step is to use this similarity graph to decide which potentials to add to the factor graph. Given the similarity graph of the protein subcellular location pattern classification problem, factor graphs built using different types of potentials are compared as we will see in the following sections.

The Potts potential

The Potts potential is a two-argument factor which encourages two nodes [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] to have the same label:

[math]\displaystyle{ \phi(x_i,x_j)= \begin{cases} \omega & \text{ }x_i=x_j\\ 1 & \text{ }otherwise\\ \end{cases} \text{ }(3) }[/math]

whereas [math]\displaystyle{ \omega\gt 1 }[/math] is an arbitrary parameter expressing how strongly [math]\displaystyle{ x_i }[/math] and [math]\displaystyle{ x_j }[/math] are believed to have the same label. If the Potts potential is used for each edge in the similarity graph, the overall probability of a vector of labels x is as follows:

[math]\displaystyle{ P(x)=\frac{1}{z}\prod_{nodes\text{ }i}P(x_i)\prod_{edges\text{ }i,j}\phi(x_i,x_j)\text{ }(4) }[/math]

where [math]\displaystyle{ Z }[/math] is a normalizing constant and [math]\displaystyle{ P(xi) }[/math] represents the probability which the base classifier assigns to label [math]\displaystyle{ x_i }[/math] for node [math]\displaystyle{ i }[/math]. The equation 4 along with Bayes network is known as a Potts model. However, this potential does not take into account, the inference from its labels of neighboring nodes.

The Voting potential

The voting potential has an argument called the center, while the remaining arguments are called voters. The key point to voting potential is that, it adds up the potentials of each of the neighboring nodes that in turn effects the classification of the object. In this paper, the center for a node is the node itself while the voters are the nodes adjacent to it in the similarity graph. Assuming that [math]\displaystyle{ N(j) }[/math] is the set of similarity graph neighbors of cell [math]\displaystyle{ j }[/math], let's write the group of cells [math]\displaystyle{ V(j)=\{j\}\cup{N(j)} }[/math]. The voting potential is then defined as follows:

[math]\displaystyle{ \phi_j(X_{V(j)})=\frac{\lambda/n+\sum_{i\in{N(j)}I(x_i,x_j)}}{|N(j)|+\lambda}\text{ }(5) }[/math]

whereas [math]\displaystyle{ n }[/math] is the number of classes, [math]\displaystyle{ \lambda }[/math] is a smoothing parameter and [math]\displaystyle{ I }[/math] is an indicator function:

[math]\displaystyle{ I(x_i,x_j)= \begin{cases} 1 & \text{ }if \text{ }x_i=x_j\\ 0 & \text{ }otherwise\\ \end{cases} }[/math]

The AMN (Associative Markov Network) potential

AMN potential is defined on a weighted graph that addresses joint distribution of random variables constrained on observed features. Each node and edge on the graph is given by a potential function. AMN potential is defined to be:

[math]\displaystyle{ \phi(x_1,...,x_k)=1+\sum_{y=1}^n(\omega_y-1)I(x_1=x_2=...=x_k=y)\text{ }(6) }[/math]

for parameters [math]\displaystyle{ \omega_y\gt 1 }[/math] where I(predicate) is defined to be [math]\displaystyle{ 1 }[/math] if the predicate is true and [math]\displaystyle{ 0 }[/math] if it is false. Therefore, the AMN potential is constant unless all the variables [math]\displaystyle{ x_1...x_k }[/math] are assigned to the same class [math]\displaystyle{ y }[/math].

Decomposable potentials

while k-way factors can lead to more accurate inference, they can also slow down belief propagation. For a general k-way factor, it takes time exponential in k. For specific k-way potentials though, it is possible to take advantage of special structure to design a fast inference algorithm. In particular, for many potential functions, it is possible to write down an algorithm which efficiently performs sums of the form required for message computation:

[math]\displaystyle{ \sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7) }[/math]
[math]\displaystyle{ \phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(8) }[/math]

where [math]\displaystyle{ m_i(x_i) }[/math] is the message to factor [math]\displaystyle{ j }[/math] from variable [math]\displaystyle{ x_i }[/math]. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message.

Equations (7) & (8) can be computed quickly if [math]\displaystyle{ \phi_j }[/math] is a sum of terms [math]\displaystyle{ \sum\psi_{jl} }[/math] where each term [math]\displaystyle{ \psi_{jl} }[/math] depends only on a small subset of its arguments [math]\displaystyle{ x_1...x_k }[/math]. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when [math]\displaystyle{ \phi_j }[/math] is a constant except at a small number of input vectors [math]\displaystyle{ (x_1,...,x_k) }[/math]. In the first case, let's say that [math]\displaystyle{ \phi_j }[/math] is a sum of low-arity terms [math]\displaystyle{ \psi_jl }[/math] and in the second case let's say that [math]\displaystyle{ \phi_j }[/math] is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing [math]\displaystyle{ \psi_{jl} }[/math] for a generic term in the sum and [math]\displaystyle{ \xi_{jlm} }[/math] for a generic factor of [math]\displaystyle{ \psi_{jl} }[/math]:

[math]\displaystyle{ \phi_j^*(x_1,...,x_k)=\sum_{l=1}^{L_j}\psi_{jl}(x_1,...,x_k)=\sum_{l=1}^{L_j}\prod_{m=1}^{M_{jl}}\xi_{jlm}(x_{V(j,l,m)})\text{ }(9) }[/math]

Using this equation, the paper shows in detail how BP or LBP could use the decomposable potentials in order to accelerate the computation of the belief messages. Message passing using Decomposable potentials is used as well with the voting potential, the AMN potential.

Prior updating

The idea of prior updating is based on the expectation that messages from a factor [math]\displaystyle{ \phi_j }[/math] to a non-centered variable [math]\displaystyle{ x_i }[/math] (where [math]\displaystyle{ i \ne c_j }[/math]) to be fairly weak; the overall vote of all of [math]\displaystyle{ x_{c_j} }[/math]'s neighbors will not be influenced very much by [math]\displaystyle{ x_i }[/math]'s single vote. Therefore, there will not be a strong penalty if [math]\displaystyle{ x_i }[/math] votes the wrong way. Prior updating suggests running LBP but ignoring all of the messages from factors to non-centered variables.

Experimental results and evaluation

After conducting experiments to determine the effect of the above mentioned potential functions and inference algorithms on the classification accuracy in structured classification problems, and after comparing the proposed approximate algorithms to their exact counterparts, the following was concluded from the results obtained:

  • Better classification accuracy can be achieved by moving from the Potts model with its two-way potentials towards models that contain k-way potentials for [math]\displaystyle{ k\gt 2 }[/math].
  • of the k-way potentials tested, the voting potential is the best for a range of problem types.
  • For small networks where exact inference is feasible, the proposed approximate inference algorithms yield results similar to exact inference at a fraction of the computational cost.
  • For larger networks where exact infeasible is intractable, the proposed approximate algorithms are still feasible, and structured classification with approximate inference makes it possible to take advantage of the similarity graph to improve classification accuracy.
  • One can reduce the time required to calculate belief messages if the graph is factored.
  • Another future possibility is using loopy message calculation algorithm which has two loops. Belief messages are approximated in the inner loop before they are given as input to the outer loop.

{{

 Template:namespace detect

| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: It is useful to show exactly how the LBP was used on structured classification problem. The current presentation mostly covers LBP. While it is very useful but it doesn't exactly match with the presentation topic.. Please improve this article if you can. (November 2011) | small = | smallimage = | smallimageright = | smalltext = }}