http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Hyeganeh&feedformat=atomstatwiki - User contributions [US]2024-03-28T14:49:09ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=graphical_models_for_structured_classification,_with_an_application_to_interpreting_images_of_protein_subcellular_location_patterns&diff=15090graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns2011-11-21T21:20:47Z<p>Hyeganeh: </p>
<hr />
<div>==Background==<br />
<br />
In standard supervised classification problems, the label of each unknown class is independent of the labels of all other instances. In some problems, however, we may receive multiple test instances<br />
at a time, along with side information about dependencies among the labels of these instances. For example, if each instance is a handwritten character, the side information might be that the string of<br />
characters forms a common English word; or, if each instance is a microscope image of a cell with a certain protein tagged, the side information might be that several cells share the same tagged protein.<br />
To solve such a structured classification problem in practice, we need both an expressive way to represent our beliefs about the structure, as well as an efficient probabilistic inference algorithm for<br />
classifying new groups of instances.<br />
In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as factor graphs can represent arbitrary dependencies among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models used. A new class of potential functions is proposed, which is called decomposable k-way potentials. It provides efficient algorithms for computing messages from these potentials during belief propagation. These new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. Three instances of decomposable potentials are discussed: the associative Markov network potential, the nested junction tree, and the voting potential. The new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy.<br />
<br />
== Factor Graphs ==<br />
The factor graph representation of a probability distribution describes the relationships among a set of variables <math>x_i</math> using local factors or potentials <math>\phi_j</math>. Each factor depends on only a subset of the variables, and the overall probability distribution is the product of the local factors, together with a normalizing constant Z:<br />
<center><math> P(x) = \frac{1}{Z} \prod_{factors j} \phi_j(x_{V_j})</math></center><br />
Here <math>V(j)</math> is the set of variables that are arguments to factor <math>j</math>; for example, if <math>\phi_j</math> depends on <math>x_1, x_3</math>,and <math>x_4</math>, then <math>V(j) = \{1,3,4\}</math> and <math>x_{V(j)} = (x_1, x_3, x_4)</math>.<br />
<br />
Each variable <math>x_i</math> or factor <math>\phi_j</math> corresponds to a node in the factor graph. Fig. 1 shows an example:<br />
the large nodes represent variables, with shaded circles for observed variables and open circles for<br />
unobserved ones. The small square nodes represent factors, and there is an edge between a variable<br />
<math>x_i</math> and a factor <math>\phi_j</math> if and only if <math>\phi_j</math> depends on <math>x_i</math>, that is, when <math>i \in V(j)</math>. (By convention the<br />
graph only shows factors with two or more arguments. Factors with just a single argument are not<br />
explicitly represented, but are implicitly allowed to be present at each variable node.)<br />
The inference task in a factor graph is to combine the evidence from all of the factors to compute<br />
properties of the distribution over <math>x</math> represented by the graph. Naively, we can do inference by<br />
enumerating all possible values of <math>x</math>, multiplying together all of the factors, and summing to compute<br />
the normalizing constant. Unfortunately, the total number of terms in the sum is exponential in the<br />
number of random variables in the graph. So, usually, a better way to perform inference is via a<br />
message-passing algorithm called belief propagation (BP). <br />
<br />
==Belief Propagation==<br />
Just to stick with the notion of the authors, let's assume that <math>\phi_i^{loc}(x_i)</math> is the one-argument factor that represents the local evidence on <math>x_i</math>. Moreover, Figure 1 shows the notion they use in graphs. The small squares denote potential functions, and, as usual, the shaded and unshaded circles represent observed and unobserved variables respectively.<br />
<br />
[[File:sum_fig1.JPG|center|frame|Fig.1: A probability distribution represented as a factor graph]]<br />
<br />
Using such notion, the message sent from a variable <math>x_i</math> to a potential function <math>\phi_k</math> as:<br />
<br />
<center><math>m_{i \rightarrow k}(x_i)=\phi_i^{loc}(x_i)\prod_{j=1}^{k-1}m_{j \rightarrow i}(x_i)\text{ }(1)</math></center><br />
<br />
Similarly, a message from a potential function <math>\phi_j</math> to <math>x_k</math> can be computed as:<br />
<br />
<center><math>m_{j \rightarrow k}(x_k)=\sum_{x_1}\sum_{x_2}...\sum_{x_{k-1}}\phi_j(x_1,...,x_k)\prod_{i=1}^{k-1}m_{i \rightarrow j}(x_i)\text{ }(2)</math></center><br />
<br />
==General graphs==<br />
The above is easily applied when the graph is tree-shaped. For graphs with loops, there are generally two alternatives, the first is to collapse groups of variable nodes together into combined nodes, which could turn the graph into a tree and makes it feasible to run Belief Propagation (BP). When a set of variable nodes are combined, the new node represents all possible settings of all<br />
of the original nodes. For example, if we collapse a variable <math>x_1</math> that has settings <math>T;F</math> with a variable <math>x_2</math> that has settings <math>A;B;C,</math> then the combined variable <math>x_{1,2}</math> has settings <math>TA;TB;TC;FA;FB;FC</math>. The second is to run an approximate inference algorithm that doesn't require a tree-shaped graph. One further solution is to combine both techniques. An example is to derive a tree-shaped graph for the graph shown in Figure 1. Figure 2 combines variables <math>x_1</math> and <math>x_2</math> to form the graph in Figure 2. The potentials <math>\phi_{23}</math> and <math>\phi_{123}</math> from the original graph have the same set of neighbors in the new graph, and so can be combined into one factor node. Similarly, the local potentials <math>\phi_{1}^{local}</math><br />
and <math>\phi_2^{loc}</math> can be combined with the factor <math>\phi_{12}</math> to form a new local potential at the collapsed node <math>x_{12}</math>. Notice that the new factor graph is tree-shaped, even<br />
though the original one had loops.<br />
<br />
[[File:sum_fig2.JPG|center|frame|Fig.2: A tree-shaped factor graph representing the graph in Fig.1]]<br />
<br />
==Loopy Belief Propagation (LBP)==<br />
If a graph is collapsed all the way to a tree, inference can be done with the exact version of BP as above. If there are still some loops left, it's LBP that should be used. In LBP (as in BP), an arbitrary node is chosen to be the root and formulas 1 & 2 are used. However, each message may have to be updated repeatedly before the marginals converge. Inference with LBP is approximate because it can double-count evidence; messages to a node <math>i</math> from two nodes <math>j</math> and <math>k</math> can both contain information from a common neighbor <math>l</math> of <math>j</math> and <math>k</math>. If LBP oscillates between some steady states and does not converge, the process could be stopped after some number of iterations. Oscillations can be avoided by using momentum, which replaces the messages that were sent at time <math>t</math> with a weighted average of the messages at times <math>t</math> and <math>t-1</math>.<br />
For either exact or loopy BP, run time for each path over the factor graph is exponential in the number of distinct original variables included in the largest factor. Therefore, inference can become prohibitively expensive if the factors are too large.<br />
<br />
==Constructing factor graphs for structured classification==<br />
To construct factor graphs that encode "likely" label vectors, two steps are performed. First, domain specific heuristics are used to identify pairs of examples whose labels are likely to be the same in order to use such pairs to build a similarity graph with an edge between each pair of examples. The second step is to use this similarity graph to decide which potentials to add to the factor graph. Given the similarity graph of the protein subcellular location pattern classification problem, factor graphs built using different types of potentials are compared as we will see in the following sections. <br />
<br />
===The Potts potential===<br />
The Potts potential is a two-argument factor which encourages two nodes <math>x_i</math> and <math>x_j</math> to have the same label:<br />
<br />
<center><math>\phi(x_i,x_j)= \begin{cases}<br />
\omega & \text{ }x_i=x_j\\<br />
1 & \text{ }otherwise\\<br />
\end{cases} \text{ }(3)<br />
</math></center><br />
<br />
whereas <math>\omega>1</math> is an arbitrary parameter expressing how strongly <math>x_i</math> and <math>x_j</math> are believed to have the same label. If the Potts potential is used for each edge in the similarity graph, the overall probability of a vector of labels x is as follows:<br />
<br />
<center><math>P(x)=\frac{1}{z}\prod_{nodes\text{ }i}P(x_i)\prod_{edges\text{ }i,j}\phi(x_i,x_j)\text{ }(4)</math></center><br />
<br />
where <math>Z</math> is a normalizing constant and <math>P(xi)</math> represents the probability which the base<br />
classifier assigns to label <math>x_i</math> for node <math>i</math>. The equation 4 is known as a Potts model.<br />
<br />
===The Voting potential===<br />
The voting potential has an argument called the center, while the remaining arguments are called voters. The key point to voting potential is that, it adds up the potentials of each of the neighboring nodes that in turn effects the classification of the object. In this paper, the center for a node is the node itself while the voters are the nodes adjacent to it in the similarity graph. Assuming that <math>N(j)</math> is the set of similarity graph neighbors of cell <math>j</math>, let's write the group of cells <math>V(j)=\{j\}\cup{N(j)}</math>. The voting potential is then defined as follows:<br />
<br />
<center><math>\phi_j(X_{V(j)})=\frac{\lambda/n+\sum_{i\in{N(j)}I(x_i,x_j)}}{|N(j)|+\lambda}\text{ }(5)</math></center><br />
<br />
whereas <math>n</math> is the number of classes, <math>\lambda</math> is a smoothing parameter and <math>I</math> is an indicator function:<br />
<br />
<center><math>I(x_i,x_j)= \begin{cases}<br />
1 & \text{ }if \text{ }x_i=x_j\\<br />
0 & \text{ }otherwise\\<br />
\end{cases}<br />
</math></center><br />
<br />
===The AMN (Associative Markov Network) potential===<br />
AMN potential is defined on a weighted graph that addresses joint distribution of random variables constrained on observed features. Each node and edge on the graph is given by a potential function. AMN potential is defined to be:<br />
<br />
<center><math>\phi(x_1,...,x_k)=1+\sum_{y=1}^n(\omega_y-1)I(x_1=x_2=...=x_k=y)\text{ }(6)</math></center><br />
<br />
for parameters <math>\omega_y>1</math> where I(predicate) is defined to be <math>1</math> if the predicate is true and <math>0</math> if it is false. Therefore, the AMN potential is constant unless all the variables <math>x_1...x_k</math> are assigned to the same class <math>y</math>.<br />
<br />
==Decomposable potentials==<br />
while k-way factors can lead to more accurate inference, they can also slow down belief propagation. For a general k-way factor, it takes time exponential in k. For specific k-way potentials though, it is possible to take advantage of special structure to design a fast inference algorithm. In particular, for many potential functions, it is possible to write down an algorithm which efficiently performs sums of the form required for message computation:<br />
<br />
<center><math>\sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7)</math></center><br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(8)</math></center><br />
<br />
where <math>m_i(x_i)</math> is the message to factor <math>j</math> from variable <math>x_i</math>. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message.<br />
<br />
Equations (7) & (8) can be computed quickly if <math>\phi_j</math> is a sum of terms <math>\sum\psi_{jl}</math> where each term <math>\psi_{jl}</math> depends only on a small subset of its arguments <math>x_1...x_k</math>. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when <math>\phi_j</math> is a constant except at a small number of input vectors <math>(x_1,...,x_k)</math>. In the first case, let's say that <math>\phi_j</math> is a sum of low-arity terms <math>\psi_jl</math> and in the second case let's say that <math>\phi_j</math> is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing <math>\psi_{jl}</math> for a generic term in the sum and <math>\xi_{jlm}</math> for a generic factor of <math>\psi_{jl}</math>:<br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=\sum_{l=1}^{L_j}\psi_{jl}(x_1,...,x_k)=\sum_{l=1}^{L_j}\prod_{m=1}^{M_{jl}}\xi_{jlm}(x_{V(j,l,m)})\text{ }(9)</math></center><br />
<br />
Using this equation, the paper shows in detail how BP or LBP could use the decomposable potentials in order to accelerate the computation of the belief messages. Message passing using Decomposable potentials is used as well with the voting potential, the AMN potential.<br />
<br />
==Prior updating==<br />
The idea of prior updating is based on the expectation that messages from a factor <math>\phi_j</math> to a non-centered variable <math>x_i</math> (where <math>i \ne c_j</math>) to be fairly weak; the overall vote of all of <math>x_{c_j}</math>'s neighbors will not be influenced very much by <math>x_i</math>'s single vote. Therefore, there will not be a strong penalty if <math>x_i</math> votes the wrong way. Prior updating suggests running LBP but ignoring all of the messages from factors to non-centered variables.<br />
<br />
==Experimental results and evaluation==<br />
After conducting experiments to determine the effect of the above mentioned potential functions and inference algorithms on the classification accuracy in structured classification problems, and after comparing the proposed approximate algorithms to their exact counterparts, the following was concluded from the results obtained:<br />
<br />
* Better classification accuracy can be achieved by moving from the Potts model with its two-way potentials towards models that contain k-way potentials for <math>k>2</math>.<br />
* of the k-way potentials tested, the voting potential is the best for a range of problem types. <br />
* For small networks where exact inference is feasible, the proposed approximate inference algorithms yield results similar to exact inference at a fraction of the computational cost.<br />
* For larger networks where exact infeasible is intractable, the proposed approximate algorithms are still feasible, and structured classification with approximate inference makes it possible to take advantage of the similarity graph to improve classification accuracy.<br />
* One can reduce the time required to calculate belief messages if the graph is factored.<br />
* Another future possibility is using loopy message calculation algorithm which has two loops. Belief messages are approximated in the inner loop before they are given as input to the outer loop.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graphical_models_for_structured_classification,_with_an_application_to_interpreting_images_of_protein_subcellular_location_patterns&diff=15084graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns2011-11-21T19:32:25Z<p>Hyeganeh: /* The Potts potential */</p>
<hr />
<div>==Background==<br />
<br />
In standard supervised classification problems, the label of each unknown class is independent of the labels of all other instances. In some problems, however, we may receive multiple test instances<br />
at a time, along with side information about dependencies among the labels of these instances. For example, if each instance is a handwritten character, the side information might be that the string of<br />
characters forms a common English word; or, if each instance is a microscope image of a cell with a certain protein tagged, the side information might be that several cells share the same tagged protein.<br />
To solve such a structured classification problem in practice, we need both an expressive way to represent our beliefs about the structure, as well as an efficient probabilistic inference algorithm for<br />
classifying new groups of instances.<br />
In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as factor graphs can represent arbitrary dependencies among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models used. A new class of potential functions is proposed, which is called decomposable k-way potentials. It provides efficient algorithms for computing messages from these potentials during belief propagation. These new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. Three instances of decomposable potentials are discussed: the associative Markov network potential, the nested junction tree, and the voting potential. The new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy.<br />
<br />
== Factor Graphs ==<br />
The factor graph representation of a probability distribution describes the relationships among a set of variables <math>x_i</math> using local factors or potentials <math>\phi_j</math>. Each factor depends on only a subset of the variables, and the overall probability distribution is the product of the local factors, together with a normalizing constant Z:<br />
<center><math> P(x) = \frac{1}{Z} \prod_{factors j} \phi_j(x_{V_j})</math></center><br />
Here <math>V(j)</math> is the set of variables that are arguments to factor <math>j</math>; for example, if <math>\phi_j</math> depends on <math>x_1, x_3</math>,and <math>x_4</math>, then <math>V(j) = \{1,3,4\}</math> and <math>x_{V(j)} = (x_1, x_3, x_4)</math>.<br />
<br />
Each variable <math>x_i</math> or factor <math>\phi_j</math> corresponds to a node in the factor graph. Fig. 1 shows an example:<br />
the large nodes represent variables, with shaded circles for observed variables and open circles for<br />
unobserved ones. The small square nodes represent factors, and there is an edge between a variable<br />
<math>x_i</math> and a factor <math>\phi_j</math> if and only if <math>\phi_j</math> depends on <math>x_i</math>, that is, when <math>i \in V(j)</math>. (By convention the<br />
graph only shows factors with two or more arguments. Factors with just a single argument are not<br />
explicitly represented, but are implicitly allowed to be present at each variable node.)<br />
The inference task in a factor graph is to combine the evidence from all of the factors to compute<br />
properties of the distribution over <math>x</math> represented by the graph. Naively, we can do inference by<br />
enumerating all possible values of <math>x</math>, multiplying together all of the factors, and summing to compute<br />
the normalizing constant. Unfortunately, the total number of terms in the sum is exponential in the<br />
number of random variables in the graph. So, usually, a better way to perform inference is via a<br />
message-passing algorithm called belief propagation (BP). <br />
<br />
==Belief Propagation==<br />
Just to stick with the notion of the authors, let's assume that <math>\phi_i^{loc}(x_i)</math> is the one-argument factor that represents the local evidence on <math>x_i</math>. Moreover, Figure 1 shows the notion they use in graphs. The small squares denote potential functions, and, as usual, the shaded and unshaded circles represent observed and unobserved variables respectively.<br />
<br />
[[File:sum_fig1.JPG|center|frame|Fig.1: A probability distribution represented as a factor graph]]<br />
<br />
Using such notion, the message sent from a variable <math>x_i</math> to a potential function <math>\phi_k</math> as:<br />
<br />
<center><math>m_{i \rightarrow k}(x_i)=\phi_i^{loc}(x_i)\prod_{j=1}^{k-1}m_{j \rightarrow i}(x_i)\text{ }(1)</math></center><br />
<br />
Similarly, a message from a potential function <math>\phi_j</math> to <math>x_k</math> can be computed as:<br />
<br />
<center><math>m_{j \rightarrow k}(x_k)=\sum_{x_1}\sum_{x_2}...\sum_{x_{k-1}}\phi_j(x_1,...,x_k)\prod_{i=1}^{k-1}m_{i \rightarrow j}(x_i)\text{ }(2)</math></center><br />
<br />
==General graphs==<br />
The above is easily applied when the graph is tree-shaped. For graphs with loops, there are generally two alternatives, the first is to collapse groups of variable nodes together into combined nodes, which could turn the graph into a tree and makes it feasible to run Belief Propagation (BP). When a set of variable nodes are combined, the new node represents all possible settings of all<br />
of the original nodes. For example, if we collapse a variable <math>x_1</math> that has settings <math>T;F</math> with a variable <math>x_2</math> that has settings <math>A;B;C,</math> then the combined variable <math>x_{1,2}</math> has settings <math>TA;TB;TC;FA;FB;FC</math>. The second is to run an approximate inference algorithm that doesn't require a tree-shaped graph. One further solution is to combine both techniques. An example is to derive a tree-shaped graph for the graph shown in Figure 1. Figure 2 combines variables <math>x_1</math> and <math>x_2</math> to form the graph in Figure 2. The potentials <math>\phi_{23}</math> and <math>\phi_{123}</math> from the original graph have the same set of neighbors in the new graph, and so can be combined into one factor node. Similarly, the local potentials <math>\phi_{1}^{local}</math><br />
and <math>\phi_2^{loc}</math> can be combined with the factor <math>\phi_{12}</math> to form a new local potential at the collapsed node <math>x_{12}</math>. Notice that the new factor graph is tree-shaped, even<br />
though the original one had loops.<br />
<br />
[[File:sum_fig2.JPG|center|frame|Fig.2: A tree-shaped factor graph representing the graph in Fig.1]]<br />
<br />
==Loopy Belief Propagation (LBP)==<br />
If a graph is collapsed all the way to a tree, inference can be done with the exact version of BP as above. If there are still some loops left, it's LBP that should be used. In LBP (as in BP), an arbitrary node is chosen to be the root and formulas 1 & 2 are used. However, each message may have to be updated repeatedly before the marginals converge. Inference with LBP is approximate because it can double-count evidence; messages to a node <math>i</math> from two nodes <math>j</math> and <math>k</math> can both contain information from a common neighbor <math>l</math> of <math>j</math> and <math>k</math>. If LBP oscillates between some steady states and does not converge, the process could be stopped after some number of iterations. Oscillations can be avoided by using momentum, which replaces the messages that were sent at time <math>t</math> with a weighted average of the messages at times <math>t</math> and <math>t-1</math>.<br />
For either exact or loopy BP, run time for each path over the factor graph is exponential in the number of distinct original variables included in the largest factor. Therefore, inference can become prohibitively expensive if the factors are too large.<br />
<br />
==Constructing factor graphs for structured classification==<br />
To construct factor graphs that encode "likely" label vectors, two steps are performed. First, domain specific heuristics are used to identify pairs of examples whose labels are likely to be the same in order to use such pairs to build a similarity graph with an edge between each pair of examples. The second step is to use this similarity graph to decide which potentials to add to the factor graph. Given the similarity graph of the protein subcellular location pattern classification problem, factor graphs built using different types of potentials are compared as we will see in the following sections. <br />
<br />
===The Potts potential===<br />
The Potts potential is a two-argument factor which encourages two nodes <math>x_i</math> and <math>x_j</math> to have the same label:<br />
<br />
<center><math>\phi(x_i,x_j)= \begin{cases}<br />
\omega & \text{ }x_i=x_j\\<br />
1 & \text{ }otherwise\\<br />
\end{cases} \text{ }(3)<br />
</math></center><br />
<br />
whereas <math>\omega>1</math> is an arbitrary parameter expressing how strongly <math>x_i</math> and <math>x_j</math> are believed to have the same label. If the Potts potential is used for each edge in the similarity graph, the overall probability of a vector of labels x is as follows:<br />
<br />
<center><math>P(x)=\frac{1}{z}\prod_{nodes\text{ }i}P(x_i)\prod_{edges\text{ }i,j}\phi(x_i,x_j)\text{ }(4)</math></center><br />
<br />
where <math>Z</math> is a normalizing constant and <math>P(xi)</math> represents the probability which the base<br />
classifier assigns to label <math>x_i<math> for node <math>i</math>. The equation 4 is known as a Potts model.<br />
<br />
===The Voting potential===<br />
The voting potential has an argument called the center, while the remaining arguments are called voters. In this paper, the center for a node is the node itself while the voters are the nodes adjacent to it in the similarity graph. Assuming that <math>N(j)</math> is the set of similarity graph neighbors of cell <math>j</math>, let's write the group of cells <math>V(j)=\{j\}\cup{N(j)}</math>. The voting potential is then defined as follows:<br />
<br />
<center><math>\phi_j(X_{V(j)})=\frac{\lambda/n+\sum_{i\in{N(j)}I(x_i,x_j)}}{|N(j)|+\lambda}\text{ }(5)</math></center><br />
<br />
whereas <math>n</math> is the number of classes, <math>\lambda</math> is a smoothing parameter and <math>I</math> is an indicator function:<br />
<br />
<center><math>I(x_i,x_j)= \begin{cases}<br />
1 & \text{ }if \text{ }x_i=x_j\\<br />
0 & \text{ }otherwise\\<br />
\end{cases}<br />
</math></center><br />
<br />
===The AMN (Associative Markov Network) potential===<br />
AMN potential is defined on a weighted graph that addresses joint distribution of random variables constrained on observed features. Each node and edge on the graph is given by a potential function. AMN potential is defined to be:<br />
<br />
<center><math>\phi(x_1,...,x_k)=1+\sum_{y=1}^n(\omega_y-1)I(x_1=x_2=...=x_k=y)\text{ }(6)</math></center><br />
<br />
for parameters <math>\omega_y>1</math> where I(predicate) is defined to be <math>1</math> if the predicate is true and <math>0</math> if it is false. Therefore, the AMN potential is constant unless all the variables <math>x_1...x_k</math> are assigned to the same class <math>y</math>.<br />
<br />
==Decomposable potentials==<br />
while k-way factors can lead to more accurate inference, they can also slow down belief propagation. For a general k-way factor, it takes time exponential in k. For specific k-way potentials though, it is possible to take advantage of special structure to design a fast inference algorithm. In particular, for many potential functions, it is possible to write down an algorithm which efficiently performs sums of the form required for message computation:<br />
<br />
<center><math>\sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7)</math></center><br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(8)</math></center><br />
<br />
where <math>m_i(x_i)</math> is the message to factor <math>j</math> from variable <math>x_i</math>. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message.<br />
<br />
Equations (7) & (8) can be computed quickly if <math>\phi_j</math> is a sum of terms <math>\sum\psi_{jl}</math> where each term <math>\psi_{jl}</math> depends only on a small subset of its arguments <math>x_1...x_k</math>. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when <math>\phi_j</math> is a constant except at a small number of input vectors <math>(x_1,...,x_k)</math>. In the first case, let's say that <math>\phi_j</math> is a sum of low-arity terms <math>\psi_jl</math> and in the second case let's say that <math>\phi_j</math> is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing <math>\psi_{jl}</math> for a generic term in the sum and <math>\xi_{jlm}</math> for a generic factor of <math>\psi_{jl}</math>:<br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=\sum_{l=1}^{L_j}\psi_{jl}(x_1,...,x_k)=\sum_{l=1}^{L_j}\prod_{m=1}^{M_{jl}}\xi_{jlm}(x_{V(j,l,m)})\text{ }(9)</math></center><br />
<br />
Using this equation, the paper shows in detail how BP or LBP could use the decomposable potentials in order to accelerate the computation of the belief messages. Message passing using Decomposable potentials is used as well with the voting potential, the AMN potential.<br />
<br />
==Prior updating==<br />
The idea of prior updating is based on the expectation that messages from a factor <math>\phi_j</math> to a non-centered variable <math>x_i</math> (where <math>i \ne c_j</math>) to be fairly weak; the overall vote of all of <math>x_{c_j}</math>'s neighbors will not be influenced very much by <math>x_i</math>'s single vote. Therefore, there will not be a strong penalty if <math>x_i</math> votes the wrong way. Prior updating suggests running LBP but ignoring all of the messages from factors to non-centered variables.<br />
<br />
==Experimental results and evaluation==<br />
After conducting experiments to determine the effect of the above mentioned potential functions and inference algorithms on the classification accuracy in structured classification problems, and after comparing the proposed approximate algorithms to their exact counterparts, the following was concluded from the results obtained:<br />
<br />
* Better classification accuracy can be achieved by moving from the Potts model with its two-way potentials towards models that contain k-way potentials for <math>k>2</math>.<br />
* of the k-way potentials tested, the voting potential is the best for a range of problem types. <br />
* For small networks where exact inference is feasible, the proposed approximate inference algorithms yield results similar to exact inference at a fraction of the computational cost.<br />
* For larger networks where exact infeasible is intractable, the proposed approximate algorithms are still feasible, and structured classification with approximate inference makes it possible to take advantage of the similarity graph to improve classification accuracy.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graphical_models_for_structured_classification,_with_an_application_to_interpreting_images_of_protein_subcellular_location_patterns&diff=15082graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns2011-11-21T18:57:30Z<p>Hyeganeh: </p>
<hr />
<div>==Background==<br />
<br />
In standard supervised classification problems, the label of each unknown class is independent of the labels of all other instances. In some problems, however, we may receive multiple test instances<br />
at a time, along with side information about dependencies among the labels of these instances. For example, if each instance is a handwritten character, the side information might be that the string of<br />
characters forms a common English word; or, if each instance is a microscope image of a cell with a certain protein tagged, the side information might be that several cells share the same tagged protein.<br />
To solve such a structured classification problem in practice, we need both an expressive way to represent our beliefs about the structure, as well as an efficient probabilistic inference algorithm for<br />
classifying new groups of instances.<br />
In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as factor graphs can represent arbitrary dependencies among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models used. A new class of potential functions is proposed, which is called decomposable k-way potentials. It provides efficient algorithms for computing messages from these potentials during belief propagation. These new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. Three instances of decomposable potentials are discussed: the associative Markov network potential, the nested junction tree, and the voting potential. The new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy.<br />
<br />
== Factor Graphs ==<br />
The factor graph representation of a probability distribution describes the relationships among a set of variables <math>x_i</math> using local factors or potentials <math>\phi_j</math>. Each factor depends on only a subset of the variables, and the overall probability distribution is the product of the local factors, together with a normalizing constant Z:<br />
<center><math> P(x) = \frac{1}{Z} \prod_{factors j} \phi_j(x_{V_j})</math></center><br />
Here <math>V(j)</math> is the set of variables that are arguments to factor <math>j</math>; for example, if <math>\phi_j</math> depends on <math>x_1, x_3</math>,and <math>x_4</math>, then <math>V(j) = \{1,3,4\}</math> and <math>x_{V(j)} = (x_1, x_3, x_4)</math>.<br />
<br />
Each variable <math>x_i</math> or factor <math>\phi_j</math> corresponds to a node in the factor graph. Fig. 1 shows an example:<br />
the large nodes represent variables, with shaded circles for observed variables and open circles for<br />
unobserved ones. The small square nodes represent factors, and there is an edge between a variable<br />
<math>x_i</math> and a factor <math>\phi_j</math> if and only if <math>\phi_j</math> depends on <math>x_i</math>, that is, when <math>i \in V(j)</math>. (By convention the<br />
graph only shows factors with two or more arguments. Factors with just a single argument are not<br />
explicitly represented, but are implicitly allowed to be present at each variable node.)<br />
The inference task in a factor graph is to combine the evidence from all of the factors to compute<br />
properties of the distribution over <math>x</math> represented by the graph. Naively, we can do inference by<br />
enumerating all possible values of <math>x</math>, multiplying together all of the factors, and summing to compute<br />
the normalizing constant. Unfortunately, the total number of terms in the sum is exponential in the<br />
number of random variables in the graph. So, usually, a better way to perform inference is via a<br />
message-passing algorithm called belief propagation (BP). <br />
<br />
==Belief Propagation==<br />
Just to stick with the notion of the authors, let's assume that <math>\phi_i^{loc}(x_i)</math> is the one-argument factor that represents the local evidence on <math>x_i</math>. Moreover, Figure 1 shows the notion they use in graphs. The small squares denote potential functions, and, as usual, the shaded and unshaded circles represent observed and unobserved variables respectively.<br />
<br />
[[File:sum_fig1.JPG|center|frame|Fig.1: A probability distribution represented as a factor graph]]<br />
<br />
Using such notion, the message sent from a variable <math>x_i</math> to a potential function <math>\phi_k</math> as:<br />
<br />
<center><math>m_{i \rightarrow k}(x_i)=\phi_i^{loc}(x_i)\prod_{j=1}^{k-1}m_{j \rightarrow i}(x_i)\text{ }(1)</math></center><br />
<br />
Similarly, a message from a potential function <math>\phi_j</math> to <math>x_k</math> can be computed as:<br />
<br />
<center><math>m_{j \rightarrow k}(x_k)=\sum_{x_1}\sum_{x_2}...\sum_{x_{k-1}}\phi_j(x_1,...,x_k)\prod_{i=1}^{k-1}m_{i \rightarrow j}(x_i)\text{ }(2)</math></center><br />
<br />
==General graphs==<br />
The above is easily applied when the graph is tree-shaped. For graphs with loops, there are generally two alternatives, the first is to collapse groups of variable nodes together into combined nodes, which could turn the graph into a tree and makes it feasible to run Belief Propagation (BP). When a set of variable nodes are combined, the new node represents all possible settings of all<br />
of the original nodes. For example, if we collapse a variable <math>x_1</math> that has settings <math>T;F</math> with a variable <math>x_2</math> that has settings <math>A;B;C,</math> then the combined variable <math>x_{1,2}</math> has settings <math>TA;TB;TC;FA;FB;FC</math>. The second is to run an approximate inference algorithm that doesn't require a tree-shaped graph. One further solution is to combine both techniques. An example is to derive a tree-shaped graph for the graph shown in Figure 1. Figure 2 combines variables <math>x_1</math> and <math>x_2</math> to form the graph in Figure 2. The potentials <math>\phi_{23}</math> and <math>\phi_{123}</math> from the original graph have the same set of neighbors in the new graph, and so can be combined into one factor node. Similarly, the local potentials <math>\phi_{1}^{local}</math><br />
and <math>\phi_2^{loc}</math> can be combined with the factor <math>\phi_{12}</math> to form a new local potential at the collapsed node <math>x_{12}</math>. Notice that the new factor graph is tree-shaped, even<br />
though the original one had loops.<br />
<br />
[[File:sum_fig2.JPG|center|frame|Fig.2: A tree-shaped factor graph representing the graph in Fig.1]]<br />
<br />
==Loopy Belief Propagation (LBP)==<br />
If a graph is collapsed all the way to a tree, inference can be done with the exact version of BP as above. If there are still some loops left, it's LBP that should be used. In LBP (as in BP), an arbitrary node is chosen to be the root and formulas 1 & 2 are used. However, each message may have to be updated repeatedly before the marginals converge. Inference with LBP is approximate because it can double-count evidence; messages to a node <math>i</math> from two nodes <math>j</math> and <math>k</math> can both contain information from a common neighbor <math>l</math> of <math>j</math> and <math>k</math>. If LBP oscillates between some steady states and does not converge, the process could be stopped after some number of iterations. Oscillations can be avoided by using momentum, which replaces the messages that were sent at time <math>t</math> with a weighted average of the messages at times <math>t</math> and <math>t-1</math>.<br />
For either exact or loopy BP, run time for each path over the factor graph is exponential in the number of distinct original variables included in the largest factor. Therefore, inference can become prohibitively expensive if the factors are too large.<br />
<br />
==Constructing factor graphs for structured classification==<br />
To construct factor graphs that encode "likely" label vectors, two steps are performed. First, domain specific heuristics are used to identify pairs of examples whose labels are likely to be the same in order to use such pairs to build a similarity graph with an edge between each pair of examples. The second step is to use this similarity graph to decide which potentials to add to the factor graph. Given the similarity graph of the protein subcellular location pattern classification problem, factor graphs built using different types of potentials are compared as we will see in the following sections. <br />
<br />
===The Potts potential===<br />
The Potts potential is a two-argument factor which encourages two nodes <math>x_i</math> and <math>x_j</math> to have the same label:<br />
<br />
<center><math>\phi(x_i,x_j)= \begin{cases}<br />
\omega & \text{ }x_i=x_j\\<br />
1 & \text{ }otherwise\\<br />
\end{cases} \text{ }(3)<br />
</math></center><br />
<br />
whereas <math>\omega>1</math> is an arbitrary parameter expressing how strongly <math>x_i</math> and <math>x_j</math> are believed to have the same label. If the Potts potential is used for each edge in the similarity graph, the overall probability of a vector of labels x is as follows:<br />
<br />
<center><math>P(x)=\frac{1}{z}\prod_{nodes\text{ }i}P(x_i)\prod_{edges\text{ }i,j}\phi(x_i,x_j)\text{ }(4)</math></center><br />
<br />
===The Voting potential===<br />
The voting potential has an argument called the center, while the remaining arguments are called voters. In this paper, the center for a node is the node itself while the voters are the nodes adjacent to it in the similarity graph. Assuming that <math>N(j)</math> is the set of similarity graph neighbors of cell <math>j</math>, let's write the group of cells <math>V(j)=\{j\}\cup{N(j)}</math>. The voting potential is then defined as follows:<br />
<br />
<center><math>\phi_j(X_{V(j)})=\frac{\lambda/n+\sum_{i\in{N(j)}I(x_i,x_j)}}{|N(j)|+\lambda}\text{ }(5)</math></center><br />
<br />
whereas <math>n</math> is the number of classes, <math>\lambda</math> is a smoothing parameter and <math>I</math> is an indicator function:<br />
<br />
<center><math>I(x_i,x_j)= \begin{cases}<br />
1 & \text{ }if \text{ }x_i=x_j\\<br />
0 & \text{ }otherwise\\<br />
\end{cases}<br />
</math></center><br />
<br />
===The AMN (Associative Markov Network) potential===<br />
AMN potential is defined to be:<br />
<br />
<center><math>\phi(x_1,...,x_k)=1+\sum_{y=1}^n(\omega_y-1)I(x_1=x_2=...=x_k=y)\text{ }(6)</math></center><br />
<br />
for parameters <math>\omega_y>1</math> where I(predicate) is defined to be <math>1</math> if the predicate is true and <math>0</math> if it is false. Therefore, the AMN potential is constant unless all the variables <math>x_1...x_k</math> are assigned to the same class <math>y</math>.<br />
<br />
==Decomposable potentials==<br />
while k-way factors can lead to more accurate inference, they can also slow down belief propagation. For a general k-way factor, it takes time exponential in k. For specific k-way potentials though, it is possible to take advantage of special structure to design a fast inference algorithm. In particular, for many potential functions, it is possible to write down an algorithm which efficiently performs sums of the form required for message computation:<br />
<br />
<center><math>\sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7)</math></center><br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(8)</math></center><br />
<br />
where <math>m_i(x_i)</math> is the message to factor <math>j</math> from variable <math>x_i</math>. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message.<br />
<br />
Equations (7) & (8) can be computed quickly if <math>\phi_j</math> is a sum of terms <math>\sum\psi_{jl}</math> where each term <math>\psi_{jl}</math> depends only on a small subset of its arguments <math>x_1...x_k</math>. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when <math>\phi_j</math> is a constant except at a small number of input vectors <math>(x_1,...,x_k)</math>. In the first case, let's say that <math>\phi_j</math> is a sum of low-arity terms <math>\psi_jl</math> and in the second case let's say that <math>\phi_j</math> is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing <math>\psi_{jl}</math> for a generic term in the sum and <math>\xi_{jlm}</math> for a generic factor of <math>\psi_{jl}</math>:<br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=\sum_{l=1}^{L_j}\psi_{jl}(x_1,...,x_k)=\sum_{l=1}^{L_j}\prod_{m=1}^{M_{jl}}\xi_{jlm}(x_{V(j,l,m)})\text{ }(9)</math></center><br />
<br />
Using this equation, the paper shows in detail how BP or LBP could use the decomposable potentials in order to accelerate the computation of the belief messages. Message passing using Decomposable potentials is used as well with the voting potential, the AMN potential.<br />
<br />
==Prior updating==<br />
The idea of prior updating is based on the expectation that messages from a factor <math>\phi_j</math> to a non-centered variable <math>x_i</math> (where <math>i \ne c_j</math>) to be fairly weak; the overall vote of all of <math>x_{c_j}</math>'s neighbors will not be influenced very much by <math>x_i</math>'s single vote. Therefore, there will not be a strong penalty if <math>x_i</math> votes the wrong way. Prior updating suggests running LBP but ignoring all of the messages from factors to non-centered variables.<br />
<br />
==Experimental results and evaluation==<br />
After conducting experiments to determine the effect of the above mentioned potential functions and inference algorithms on the classification accuracy in structured classification problems, and after comparing the proposed approximate algorithms to their exact counterparts, the following was concluded from the results obtained:<br />
<br />
* Better classification accuracy can be achieved by moving from the Potts model with its two-way potentials towards models that contain k-way potentials for <math>k>2</math>.<br />
* of the k-way potentials tested, the voting potential is the best for a range of problem types. <br />
* For small networks where exact inference is feasible, the proposed approximate inference algorithms yield results similar to exact inference at a fraction of the computational cost.<br />
* For larger networks where exact infeasible is intractable, the proposed approximate algorithms are still feasible, and structured classification with approximate inference makes it possible to take advantage of the similarity graph to improve classification accuracy.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=graphical_models_for_structured_classification,_with_an_application_to_interpreting_images_of_protein_subcellular_location_patterns&diff=15075graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns2011-11-21T15:17:21Z<p>Hyeganeh: </p>
<hr />
<div>==Background==<br />
<br />
In standard supervised classification problems, the label of each unknown class is independent of the labels of all other instances. In some problems, however, we may receive multiple test instances<br />
at a time, along with side information about dependencies among the labels of these instances. For example, if each instance is a handwritten character, the side information might be that the string of<br />
characters forms a common English word; or, if each instance is a microscope image of a cell with a certain protein tagged, the side information might be that several cells share the same tagged protein.<br />
To solve such a structured classification problem in practice, we need both an expressive way to represent our beliefs about the structure, as well as an efficient probabilistic inference algorithm for<br />
classifying new groups of instances.<br />
In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as factor graphs can represent arbitrary dependencies among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models used. A new class of potential functions is proposed, which is called decomposable k-way potentials. It provides efficient algorithms for computing messages from these potentials during belief propagation. These new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. Three instances of decomposable potentials are discussed: the associative Markov network potential, the nested junction tree, and the voting potential. The new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy.<br />
<br />
== Factor Graphs ==<br />
The factor graph representation of a probability distribution describes the relationships among a set of variables <math>x_i</math> using local factors or potentials <math>\phi_j</math>. Each factor depends on only a subset of the variables, and the overall probability distribution is the product of the local factors, together with a normalizing constant Z:<br />
<center><math> P(x) = \frac{1}{Z} \prod_{factors j} \phi_j(x_{V_j})</math></center><br />
Here <math>V(j)</math> is the set of variables that are arguments to factor <math>j</math>; for example, if <math>\phi_j</math> depends on <math>x_1, x_3</math>,and <math>x_4</math>, then <math>V(j) = \{1,3,4\}</math> and <math>x_{V(j)} = (x_1, x_3, x_4)</math>.<br />
<br />
Each variable <math>x_i</math> or factor <math>\phi_j</math> corresponds to a node in the factor graph. Fig. 1 shows an example:<br />
the large nodes represent variables, with shaded circles for observed variables and open circles for<br />
unobserved ones. The small square nodes represent factors, and there is an edge between a variable<br />
<math>x_i</math> and a factor <math>\phi_j</math> if and only if <math>\phi_j</math> depends on <math>x_i</math>, that is, when <math>i \in V(j)</math>. (By convention the<br />
graph only shows factors with two or more arguments. Factors with just a single argument are not<br />
explicitly represented, but are implicitly allowed to be present at each variable node.)<br />
The inference task in a factor graph is to combine the evidence from all of the factors to compute<br />
properties of the distribution over <math>x</math> represented by the graph. Naively, we can do inference by<br />
enumerating all possible values of <math>x</math>, multiplying together all of the factors, and summing to compute<br />
the normalizing constant. Unfortunately, the total number of terms in the sum is exponential in the<br />
number of random variables in the graph. So, usually, a better way to perform inference is via a<br />
message-passing algorithm called belief propagation (BP). <br />
<br />
==Belief Propagation==<br />
Just to stick with the notion of the authors, let's assume that <math>\phi_i^{loc}(x_i)</math> is the one-argument factor that represents the local evidence on <math>x_i</math>. Moreover, Figure 1 shows the notion they use in graphs. The small squares denote potential functions, and, as usual, the shaded and unshaded circles represent observed and unobserved variables respectively.<br />
<br />
[[File:sum_fig1.JPG|center|frame|Fig.1: A probability distribution represented as a factor graph]]<br />
<br />
Using such notion, the message sent from a variable <math>x_i</math> to a potential function <math>\phi_k</math> as:<br />
<br />
<center><math>m_{i \rightarrow k}(x_i)=\phi_i^{loc}(x_i)\prod_{j=1}^{k-1}m_{j \rightarrow i}(x_i)\text{ }(1)</math></center><br />
<br />
Similarly, a message from a potential function <math>\phi_j</math> to <math>x_k</math> can be computed as:<br />
<br />
<center><math>m_{j \rightarrow k}(x_k)=\sum_{x_1}\sum_{x_2}...\sum_{x_{k-1}}\phi_j(x_1,...,x_k)\prod_{i=1}^{k-1}m_{i \rightarrow j}(x_i)\text{ }(2)</math></center><br />
<br />
==General graphs==<br />
The above is easily applied when the graph is tree-shaped. For graphs with loops, there are generally two alternatives, the first is to collapse groups of variable nodes together into combined nodes, which could turn the graph into a tree and makes it feasible to run Belief Propagation (BP). The second is to run an approximate inference algorithm that doesn't require a tree-shaped graph. One further solution is to combine both techniques. An example is to derive a tree-shaped graph for the graph shown in Figure 1. Figure 2 combines variables <math>x_1</math> and <math>x_2</math> to form the graph in Figure 2.<br />
<br />
[[File:sum_fig2.JPG|center|frame|Fig.2: A tree-shaped factor graph representing the graph in Fig.1]]<br />
<br />
==Loopy Belief Propagation (LBP)==<br />
If a graph is collapsed all the way to a tree, inference can be done with the exact version of BP as above. If there are still some loops left, it's LBP that should be used. In LBP (as in BP), an arbitrary node is chosen to be the root and formulas 1 & 2 are used. However, each message may have to be updated repeatedly before the marginals converge. Inference with LBP is approximate because it can double-count evidence; messages to a node <math>i</math> from two nodes <math>j</math> and <math>k</math> can both contain information from a common neighbor <math>l</math> of <math>j</math> and <math>k</math>. If LBP oscillates between some steady states and does not converge, the process could be stopped after some number of iterations. Oscillations can be avoided by using momentum, which replaces the messages that were sent at time <math>t</math> with a weighted average of the messages at times <math>t</math> and <math>t-1</math>.<br />
For either exact or loopy BP, run time for each path over the factor graph is exponential in the number of distinct original variables included in the largest factor. Therefore, inference can become prohibitively expensive if the factors are too large.<br />
<br />
==Constructing factor graphs for structured classification==<br />
To construct factor graphs that encode "likely" label vectors, two steps are performed. First, domain specific heuristics are used to identify pairs of examples whose labels are likely to be the same in order to use such pairs to build a similarity graph with an edge between each pair of examples. The second step is to use this similarity graph to decide which potentials to add to the factor graph. Given the similarity graph of the protein subcellular location pattern classification problem, factor graphs built using different types of potentials are compared as we will see in the following sections. <br />
<br />
===The Potts potential===<br />
The Potts potential is a two-argument factor which encourages two nodes <math>x_i</math> and <math>x_j</math> to have the same label:<br />
<br />
<center><math>\phi(x_i,x_j)= \begin{cases}<br />
\omega & \text{ }x_i=x_j\\<br />
1 & \text{ }otherwise\\<br />
\end{cases} \text{ }(3)<br />
</math></center><br />
<br />
whereas <math>\omega>1</math> is an arbitrary parameter expressing how strongly <math>x_i</math> and <math>x_j</math> are believed to have the same label. If the Potts potential is used for each edge in the similarity graph, the overall probability of a vector of labels x is as follows:<br />
<br />
<center><math>P(x)=\frac{1}{z}\prod_{nodes\text{ }i}P(x_i)\prod_{edges\text{ }i,j}\phi(x_i,x_j)\text{ }(4)</math></center><br />
<br />
===The Voting potential===<br />
The voting potential has an argument called the center, while the remaining arguments are called voters. In this paper, the center for a node is the node itself while the voters are the nodes adjacent to it in the similarity graph. Assuming that <math>N(j)</math> is the set of similarity graph neighbors of cell <math>j</math>, let's write the group of cells <math>V(j)=\{j\}\cup{N(j)}</math>. The voting potential is then defined as follows:<br />
<br />
<center><math>\phi_j(X_{V(j)})=\frac{\lambda/n+\sum_{i\in{N(j)}I(x_i,x_j)}}{|N(j)|+\lambda}\text{ }(5)</math></center><br />
<br />
whereas <math>n</math> is the number of classes, <math>\lambda</math> is a smoothing parameter and <math>I</math> is an indicator function:<br />
<br />
<center><math>I(x_i,x_j)= \begin{cases}<br />
1 & \text{ }if \text{ }x_i=x_j\\<br />
0 & \text{ }otherwise\\<br />
\end{cases}<br />
</math></center><br />
<br />
===The AMN (Associative Markov Network) potential===<br />
AMN potential is defined to be:<br />
<br />
<center><math>\phi(x_1,...,x_k)=1+\sum_{y=1}^n(\omega_y-1)I(x_1=x_2=...=x_k=y)\text{ }(6)</math></center><br />
<br />
for parameters <math>\omega_y>1</math> where I(predicate) is defined to be <math>1</math> if the predicate is true and <math>0</math> if it is false. Therefore, the AMN potential is constant unless all the variables <math>x_1...x_k</math> are assigned to the same class <math>y</math>.<br />
<br />
==Decomposable potentials==<br />
while k-way factors can lead to more accurate inference, they can also slow down belief propagation. For a general k-way factor, it takes time exponential in k. For specific k-way potentials though, it is possible to take advantage of special structure to design a fast inference algorithm. In particular, for many potential functions, it is possible to write down an algorithm which efficiently performs sums of the form required for message computation:<br />
<br />
<center><math>\sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7)</math></center><br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(8)</math></center><br />
<br />
where <math>m_i(x_i)</math> is the message to factor <math>j</math> from variable <math>x_i</math>. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message.<br />
<br />
Equations (7) & (8) can be computed quickly if <math>\phi_j</math> is a sum of terms <math>\sum\psi_{jl}</math> where each term <math>\psi_{jl}</math> depends only on a small subset of its arguments <math>x_1...x_k</math>. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when <math>\phi_j</math> is a constant except at a small number of input vectors <math>(x_1,...,x_k)</math>. In the first case, let's say that <math>\phi_j</math> is a sum of low-arity terms <math>\psi_jl</math> and in the second case let's say that <math>\phi_j</math> is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing <math>\psi_{jl}</math> for a generic term in the sum and <math>\xi_{jlm}</math> for a generic factor of <math>\psi_{jl}</math>:<br />
<br />
<center><math>\phi_j^*(x_1,...,x_k)=\sum_{l=1}^{L_j}\psi_{jl}(x_1,...,x_k)=\sum_{l=1}^{L_j}\prod_{m=1}^{M_{jl}}\xi_{jlm}(x_{V(j,l,m)})\text{ }(9)</math></center><br />
<br />
Using this equation, the paper shows in detail how BP or LBP could use the decomposable potentials in order to accelerate the computation of the belief messages. Message passing using Decomposable potentials is used as well with the voting potential, the AMN potential.<br />
<br />
==Prior updating==<br />
The idea of prior updating is based on the expectation that messages from a factor <math>\phi_j</math> to a non-centered variable <math>x_i</math> (where <math>i \ne c_j</math>) to be fairly weak; the overall vote of all of <math>x_{c_j}</math>'s neighbors will not be influenced very much by <math>x_i</math>'s single vote. Therefore, there will not be a strong penalty if <math>x_i</math> votes the wrong way. Prior updating suggests running LBP but ignoring all of the messages from factors to non-centered variables.<br />
<br />
==Experimental results and evaluation==<br />
After conducting experiments to determine the effect of the above mentioned potential functions and inference algorithms on the classification accuracy in structured classification problems, and after comparing the proposed approximate algorithms to their exact counterparts, the following was concluded from the results obtained:<br />
<br />
* Better classification accuracy can be achieved by moving from the Potts model with its two-way potentials towards models that contain k-way potentials for <math>k>2</math>.<br />
* of the k-way potentials tested, the voting potential is the best for a range of problem types. <br />
* For small networks where exact inference is feasible, the proposed approximate inference algorithms yield results similar to exact inference at a fraction of the computational cost.<br />
* For larger networks where exact infeasible is intractable, the proposed approximate algorithms are still feasible, and structured classification with approximate inference makes it possible to take advantage of the similarity graph to improve classification accuracy.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=an_HDP-HMM_for_Systems_with_State_Persistence&diff=15074an HDP-HMM for Systems with State Persistence2011-11-21T14:25:47Z<p>Hyeganeh: </p>
<hr />
<div>== Introduction == <br />
=== The Big Picture ===<br />
Hidden Markov Model is one of the most effective and widely used probabilistic models for time series data. A serious limitation of this model is the needs to define the number of states for the model on prior. This number cannot easily determine in real life applications. The usual way to define it is by a trying several different numbers of states and choose the one that gives the best results (trial and error approach). This limitation can be overcome by types of probabilistic models call “Bayesian Nonparametric Models”. Bayesian Nonparametric Models approach this problem by defining distributions that can have infinite number of parameters. However, the assumption here is that even though the distribution could has infinite number of parameters, only finite number of them is required to explain the observed data.<br />
<br />
A huge breakthrough in the possible applications of “Bayesian Nonparametric Methods” occurred after a paper published in 2005 by Yee Whye The, Michael I. Jordan, et.al. The title of the paper was “Hierarchical Dirichlet Processes (HDP)”. This paper describes a way to define a nonparametric Bayesian prior that allows atoms (which can be seen as components of mixture models) to be shared between groups of data (draws from mixture models). One application of this model was an extension to the Hidden Markov Model. The new model, which named HDP-HMM, allows the number of stats to be infinite.<br />
<br />
The paper that I’m reviewing here proposes an augmented version of the HDP-HMM that solves the problem of state persistence. This problem is also arises in the original Hidden Markov Model. The stats in situations where this problem could occur have the tendency not to change their values, i.e. the transition probability to a new state is less than the probability of staying at the same stats. The authors did not only provide a solution for this problem, but they also provided a full Bayesian treatment for it.<br />
<br />
=== Speaker Diarization Problem ===<br />
The task of segmenting or annotating an audio recording into different temporal segments where each segment corresponds to a specific event or speaker is called Speaker Diarization. One example of where Speaker Diarization could be useful is when analyzing a few minutes recorded audio from a radio broadcast. This few minutes could consist of commercials, intro music, a speech of the host, and a speech of the guest. The Speaker Diarization problem on such an audio recording could be defined as identifying speech and non-speech segments. Audio recording of meetings with known or unknown number of participants is another example the uses of Speaker Diarization. In this case the task is to answer the question of “who spoke when”. <ref>Tranter, S.E. and Reynolds, D.A., An Overview of Automatic Speaker Diarization Systems, 2006</ref><br />
<br />
[[File:Sd1.jpg]]<br />
<br />
The most common technique for solving the Speaker Diarization problem, which has also showed better results than others, is using Hidden Markov Models. In this setting, each speaker is associated with a specified state in the HMM. The transition among these stats represents the transition among speakers. However, a model like this suffers from a serious limitation. It requires the number of speakers to be known in advanced, which is needed to design the structure of the model. <br />
<br />
<br />
The authors of this paper proposed a solution to the problem of Speaker Diarization on situations where a prior knowledge of the number of participants is unavailable. Their solution is based on a slightly modified version of an interesting Bayesian non-parametric model called “Hierarchical Dirichlet Process–Hidden Markov Model (HDP-HMM)”. The proposed modified version, which named “Sticky HDP-HMM”, imposes state persistence on the model.<br />
<br />
== Background ==<br />
<br />
=== Nonparametric Bayesian Models ===<br />
The Nonparametric Bayesian Models that we are considering in this discussion can be defined (informally) as follows. The word nonparametric does not mean that those models have no parameters, but in contrary it means that those models have infinite parameters spaces. And the fact that those models are Bayesian means that we are defining probability measures over them. <br />
<br />
=== Dirichlet Process ===<br />
Dirichlet Process is an example of Nonparametric Bayesian Models. It is a generalization of the Dirichlet Distribution, where the number of parameters can be infinite. The Dirichlet Distribution can be seen as a distribution over distributions of N outcomes. Thus, a draw from a Dirichlet Distribution is itself a probability distribution. Dirichlet Process has the same characteristic; it is a distribution over distributions, but the difference is that the outcomes N can grow and go to infinity. The DP is commonly used as a prior on the parameters of a mixture model of unknown complexity, resulting in a DPMM.<br />
<br />
==== Dirichlet Distribution ====<br />
Let <math>\theta = \{ \theta_1, \theta_2, ..., \theta_m\}</math> such that <math>\theta</math>~<math> Dirichlet(\alpha_1, \alpha_2, ..., \alpha_m)</math>, then the distribution is defined by:<br />
<math>\, P(\theta_1, \theta_2, \theta_3 ,...,\theta_m) = \frac{\Gamma (\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^{m} \theta_k^{\alpha_k - 1}</math>. As an example, the Beta distribution is the special case of Dirichlet distribution for two dimensions. In fact, the Dirichlet distribution is a "distribution over distribution".<br />
<br />
A Dirichlet process <math>DP(H,\alpha)</math> is defined using two parameters. The first parameter, <math>H</math>, is a base distribution. This parameter can be considered as the mean of the Dirichlet Process. The second parameter, <math>\alpha</math>, is called the concentration parameter.<br />
<br />
As described above, a draw from a Dirichlet process is discrete probability measure, regardless of wither the base distribution is discrete or continuous. This fact is one of the most important properties of Dirichlet Process; it assures that draws from the Dirichlet Process can be repeated.<br />
<br />
A stick breaking construction of the Dirichlet Process was introduced in (Sethuraman, 1994), as follows: <br />
<br />
<math>\, \beta_k = \beta_{‘}^{k} \prod_{l=1}^{k-1}(1-\beta_{l}^{'})</math><br />
<br />
<math>\, \beta_{l}^{'}</math>~<math>\, Beta(1,\gamma)</math><br />
<br />
<math>\, G_o = \sum_{k=1}^{\infty} \beta_k \delta(\theta - \theta_k)</math><br />
<br />
<math>\, \theta_k</math>~<math>\, H</math><br />
<br />
This specific construction is usually denoted by:<br />
<math>\, \beta</math>~<math>\, GEM(\gamma)</math><br />
<br />
=== Hierarchical Dirichlet Processes ===<br />
The hierarchical Dirichlet process (HDP) (Teh et al.,2006) extends the DP to cases in which groups of data are produced by related, yet unique, generative processes.<br />
Two Dirichlet Process can be used together in a recursion way such that a draw from the first one uses as the base distribution of the second one. Formally, <br />
<br />
<math>\, G </math> ~ <math>\, DP(\alpha, G_0) </math><br />
<br />
<math>\, G_0 </math> ~ <math>\, DP(\gamma, H)</math>.<br />
<br />
Doing so would allow us to share atoms. The analogy here is that we are modeling a grouped data and each group of them was modeled using a mixture model.<br />
<br />
<br />
<br />
== Sticky HDP-HMM ==<br />
<br />
=== HMM ===<br />
One way to look at the HMM is as a doubly stochastic Markov chain. It consists of a set of state variables that are usually defined as a multinomial random variable. These variables are linked together by a transition matrix. The observations are independent of each other and conditioned only on the state variable that they are belonging to.<br />
<br />
=== HDP-HMM ===<br />
In [Ref::Teh05], an extension to the HMM was introduced. This extension allows the number of stats to be infinite using the Hierarchical Dirichlet Process. Their extension works as follows. Let <math>\, S_t</math> denote the state of the HMM at time t. As this variable follows a Markov chain, its value is going to be drawing from a distribution that’s conditioned on the value of the previous state <math>\, S_t </math> ~ <math>\, \pi_{S_t-1}</math>. The value of this variable, <math>\, S_t</math>, is then used to draw an observation, i.e <math>\, y</math> ~ <math>\, F(\theta_{S_t}))</math>. Now by defining <math>\, \pi_{k} </math> as: <math>\, \pi_{k} </math> ~ <math>\, HDP(.)</math>. We can have infinite shared states among all the stat variables. <ref><br />
Teh, 2055. Hierarchical Dirichlet Processes<br />
</ref><br />
<br />
=== Sticky HDP-HMM ===<br />
Although substituting the distribution of <math>\, \pi_{k} </math> by a hierarchal nonparametric Bayesian prior provided us with a way to have infinite number of states, the problem of states persistence, which was described above, is still existed. It Is getting even more serious under this setting as the HDP would generate more states to explain all the transitions, so instead of having a larger probability to stay at the same state, a new redundant states would be generated.<br />
The next figure shows an example of this problem. The right side of the figure shows the true state sequence in which there are only three states that explain the observed data (labeled using different colors). The right left side, on the other hand, shows the inferred state sequence for the same data using HDP-HMM. In this case, the model spited some of the true states into more than one state and rapidly switches between them. <br />
<br />
[[File:Hdp-hmm_problem.jpg]]<br />
<br />
To overcome this problem, the authors proposed a modification to the transition distribution to be as follows: <br />
<br />
<math>\, \pi_j</math> ~ <math>\, DP( \alpha+ k, \frac{\alpha \beta + k \delta_j}{\alpha + j})</math><br />
<br />
This main new term <math>\, k \delta_j</math> adds the amount k to the jth component of the base distribution so that it increases the prior probability of self-transition from state j to itself. Note that the original HDP-HMM can be restored from this model by setting k = 0.<br />
<br />
== References ==<br />
<references/></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14770markov Random Fields for Super-Resolution2011-11-13T15:38:32Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. Freeman first introduced a probabilistic approach <ref name="R1"> W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, "Learning Low-Level Vision", International Journal of Computer Vision, 40(1), pp. 25-47, 2000.</ref> <ref name="R2"> W. Freeman and E. Pasztor, "Markov Networks for Super-Resolution", in Proc. of 34th Annual Conference on Information Sciences and Systems, Princeton University, March 2000.</ref> in which they try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k)<br />
</math> (1)</center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)<br />
</math> (2)</center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N))<br />
</math> (3)</center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate <ref name="R3"> Jordan, M.I. (Ed.). 1998. Learning in Graphical Models. MIT Press: Cambridge, MA </ref>. For example for the network in Figure 2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3)\\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math> (4)</center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j<br />
</math> (5)</center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k <br />
</math> (6)</center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)<br />
</math> (7)</center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)<br />
</math> (8)</center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1)<br />
</math> (9)</center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)<br />
</math> (10)</center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2<br />
</math> (11)</center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)<br />
</math> (12)</center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 <br />
</math> (13)</center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1)<br />
</math> (14)</center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3)<br />
</math> (15)</center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, Weiss and Freeman demonstrate the propagation rules works even in the network with loops <ref name="R4"> Weiss, Y. and Freeman, W.T. 1999. Correctness of belief propagation<br />
in Gaussian graphical models of arbitrary topology. Technical Report UCB.CSD-99-1046, Berkeley. </ref> Summary of results from Weiss and Freeman regarding belief propagation results after convergence is provided in Figure 3. Fig.3 [[File:Mrf3.jpg||center|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure 5 (a) show the low-resolution image which then is expanded to have the same size as the desired high resolution image using a typical interpolation algorithms such as bicubic method (Figure 5 (b)). The image in (c) in the original high resolution image. Images in Figure 5 (d) and (e) are the first level of the Laplacian pyramid decomposition for "image" and “scene” images, respectively. In other words, the high frequency component in Figure 5 (e) should be estimated using frequency component in Figure 5(d). <br />
<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]<br />
<br />
In order to utilize Markov network for this problem the "image" and "scene" images are divided into local patches as shown in Figure 6 and the final estimate for "scene" image , x,, is the collection which maximizes probability of <math> P(x|y) </math> using Equation 1. <br />
<br />
Figure6 [[File:Mrf9.jpg|center|Fig.6 .]]<br />
<br />
Figure 7 illustrates an example of a given patch in y and its corresponding "scene" patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors in some papers use 7*7 patch size for low frequency band and 3*3 for high frequency components, but in ref they use 7*7 for the "image" and 5*5 for the "scene". <br />
<br />
Figure7 [[File:Mrf7.jpg|center|Fig.6 .]]<br />
<br />
Fugure 8 compares the results of the proposed approach in this paper and different super-resolution schemes. The interesting thing is the effect of training set on the final result. Because the estimate "scene" patch is always chosen from the training database the final result resembles the training set in some manner. <br />
Figure8 [[File:results.jpg|center|Fig.6 .]]<br />
<br />
== References ==<br />
<references /></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Mrf3.jpg&diff=14769File:Mrf3.jpg2011-11-13T15:31:55Z<p>Hyeganeh: uploaded a new version of &quot;File:Mrf3.jpg&quot;</p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14768markov Random Fields for Super-Resolution2011-11-13T15:30:03Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. Freeman first introduced a probabilistic approach <ref name="R1"> W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, "Learning Low-Level Vision", International Journal of Computer Vision, 40(1), pp. 25-47, 2000.</ref> <ref name="R2"> W. Freeman and E. Pasztor, "Markov Networks for Super-Resolution", in Proc. of 34th Annual Conference on Information Sciences and Systems, Princeton University, March 2000.</ref> in which they try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k)<br />
</math> (1)</center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)<br />
</math> (2)</center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate <ref name="R3"> Jordan, M.I. (Ed.). 1998. Learning in Graphical Models. MIT Press: Cambridge, MA </ref>. For example for the network in Figure 2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, Weiss and Freeman demonstrate the propagation rules works even in the network with loops <ref name="R4"> Weiss, Y. and Freeman, W.T. 1999. Correctness of belief propagation<br />
in Gaussian graphical models of arbitrary topology. Technical Report UCB.CSD-99-1046, Berkeley. </ref> Summary of results from Weiss and Freeman regarding belief propagation results after convergence is provided in Figure 3. Fig.3 [[File:Mrf3.jpg||center|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure 5 (a) show the low-resolution image which then is expanded to have the same size as the desired high resolution image using a typical interpolation algorithms such as bicubic method (Figure 5 (b)). The image in (c) in the original high resolution image. Images in Figure 5 (d) and (e) are the first level of the Laplacian pyramid decomposition for "image" and “scene” images, respectively. In other words, the high frequency component in Figure 5 (e) should be estimated using frequency component in Figure 5(d). <br />
<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]<br />
<br />
In order to utilize Markov network for this problem the "image" and "scene" images are divided into local patches as shown in Figure 6 and the final estimate for "scene" image , x,, is the collection which maximizes probability of <math> P(x|y) </math> using Equation 1. <br />
<br />
Figure6 [[File:Mrf9.jpg|center|Fig.6 .]]<br />
<br />
Figure 7 illustrates an example of a given patch in y and its corresponding "scene" patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors in some papers use 7*7 patch size for low frequency band and 3*3 for high frequency components, but in ref they use 7*7 for the "image" and 5*5 for the "scene". <br />
<br />
Figure7 [[File:Mrf7.jpg|center|Fig.6 .]]<br />
<br />
Fugure 8 compares the results of the proposed approach in this paper and different super-resolution schemes. The interesting thing is the effect of training set on the final result. Because the estimate "scene" patch is always chosen from the training database the final result resembles the training set in some manner. <br />
Figure8 [[File:results.jpg|center|Fig.6 .]]<br />
<br />
== References ==<br />
<references /></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14767markov Random Fields for Super-Resolution2011-11-13T15:20:06Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. Freeman first introduced a probabilistic approach <ref name="R1"> W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, "Learning Low-Level Vision", International Journal of Computer Vision, 40(1), pp. 25-47, 2000.</ref> <ref name="R2"> W. Freeman and E. Pasztor, "Markov Networks for Super-Resolution", in Proc. of 34th Annual Conference on Information Sciences and Systems, Princeton University, March 2000.</ref> in which they try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k)<br />
</math> (1)</center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)<br />
</math> (2)</center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate <ref name="R3"> Jordan, M.I. (Ed.). 1998. Learning in Graphical Models. MIT Press: Cambridge, MA </ref>. For example for the network in Figure 2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure 5 (a) show the low-resolution image which then is expanded to have the same size as the desired high resolution image using a typical interpolation algorithms such as bicubic method (Figure 5 (b)). The image in (c) in the original high resolution image. Images in Figure 5 (d) and (e) are the first level of the Laplacian pyramid decomposition for "image" and “scene” images, respectively. In other words, the high frequency component in Figure 5 (e) should be estimated using frequency component in Figure 5(d). <br />
<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]<br />
<br />
In order to utilize Markov network for this problem the "image" and "scene" images are divided into local patches as shown in Figure 6 and the final estimate for "scene" image , x,, is the collection which maximizes probability of <math> P(x|y) </math> using Equation 1. <br />
<br />
Figure6 [[File:Mrf9.jpg|center|Fig.6 .]]<br />
<br />
Figure 7 illustrates an example of a given patch in y and its corresponding "scene" patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors in some papers use 7*7 patch size for low frequency band and 3*3 for high frequency components, but in ref they use 7*7 for the "image" and 5*5 for the "scene". <br />
<br />
Figure7 [[File:Mrf7.jpg|center|Fig.6 .]]<br />
<br />
Fugure 8 compares the results of the proposed approach in this paper and different super-resolution schemes. The interesting thing is the effect of training set on the final result. Because the estimate "scene" patch is always chosen from the training database the final result resembles the training set in some manner. <br />
Figure8 [[File:results.jpg|center|Fig.6 .]]<br />
<br />
== References ==<br />
<references /></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14764markov Random Fields for Super-Resolution2011-11-13T15:07:31Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. Freeman first introduced a probabilistic approach <ref name="R1"> Freeman W. , “Learning Low-level Vision” in IEEE Trans. on Info. theory, vol 52, no 4,pp. 1289–1306, Apr. 2006.</ref> in which they try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.Donoho <ref name="R1"> D. Donoho, “Compressed Sensing,” in IEEE Trans. on Info. theory, vol 52, no 4,pp. 1289–1306, Apr. 2006.</ref> and Candes et. al <ref name="R2"> E. Candes, J. Romberg, J.; T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” in IEEE Trans. on Info. theory, vol 52, no 2,pp. 489–509, Feb. 2006.</ref><br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k)<br />
</math> (1)</center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)<br />
</math> (2)</center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure 5 (a) show the low-resolution image which then is expanded to have the same size as the desired high resolution image using a typical interpolation algorithms such as bicubic method (Figure 5 (b)). The image in (c) in the original high resolution image. Images in Figure 5 (d) and (e) are the first level of the Laplacian pyramid decomposition for "image" and “scene” images, respectively. In other words, the high frequency component in Figure 5 (e) should be estimated using frequency component in Figure 5(d). <br />
<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]<br />
<br />
In order to utilize Markov network for this problem the "image" and "scene" images are divided into local patches as shown in Figure 6 and the final estimate for "scene" image , x,, is the collection which maximizes probability of <math> P(x|y) </math> using Equation 1. <br />
<br />
Figure6 [[File:Mrf9.jpg|center|Fig.6 .]]<br />
<br />
Figure 7 illustrates an example of a given patch in y and its corresponding "scene" patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors in some papers use 7*7 patch size for low frequency band and 3*3 for high frequency components, but in ref they use 7*7 for the "image" and 5*5 for the "scene". <br />
<br />
Figure7 [[File:Mrf7.jpg|center|Fig.6 .]]<br />
<br />
Fugure 8 compares the results of the proposed approach in this paper and different super-resolution schemes. The interesting thing is the effect of training set on the final result. Because the estimate "scene" patch is always chosen from the training database the final result resembles the training set in some manner. <br />
Figure8 [[File:results.jpg|center|Fig.6 .]]<br />
<br />
== References ==<br />
<references /></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Results.jpg&diff=14763File:Results.jpg2011-11-13T15:02:15Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14762markov Random Fields for Super-Resolution2011-11-13T14:41:51Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure 5 (a) show the low-resolution image which then is expanded to have the same size as the desired high resolution image using a typical interpolation algorithms such as bicubic method (Figure 5 (b)). The image in (c) in the original high resolution image. Images in Figure 5 (d) and (e) are the first level of the Laplacian pyramid decomposition for "image" and “scene” images, respectively. In other words, the high frequency component in Figure 5 (e) should be estimated using frequency component in Figure 5(d). <br />
<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]<br />
<br />
In order to utilize Markov network for this problem the "image" and "scene" images are divided into local patches as shown in Figure 6 and the final estimate for "scene" image , x,, is the collection which maximizes probability of <math> P(x|y) </math> using Equation 1. <br />
<br />
Figure6 [[File:Mrf9.jpg|center|Fig.6 .]]<br />
<br />
Figure 7 illustrates an example of a given patch in y and its corresponding "scene" patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors in some papers use 7*7 patch size for low frequency band and 3*3 for high frequency components, but in ref they use 7*7 for the "image" and 5*5 for the "scene". <br />
<br />
Figure7 [[File:Mrf7.jpg|center|Fig.6 .]]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14761markov Random Fields for Super-Resolution2011-11-13T14:15:55Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrast normalized version of the high frequency components. <br />
The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]<br />
<br />
saakakakakakakakka<br />
<br />
Figure5 [[File:Mrf9.jpg|center|Fig.6 .]]<br />
<br />
jhkjddhdhdhdhd<br />
<br />
Figure5 [[File:Mrf7.jpg|center|Fig.6 .]]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Mrf7.jpg&diff=14760File:Mrf7.jpg2011-11-13T14:14:52Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Mrf9.jpg&diff=14759File:Mrf9.jpg2011-11-13T14:11:28Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:MRF6.jpg&diff=14742File:MRF6.jpg2011-11-12T18:49:38Z<p>Hyeganeh: uploaded a new version of &quot;File:MRF6.jpg&quot;</p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14740markov Random Fields for Super-Resolution2011-11-12T18:33:37Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network which are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidean distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysis (PCA) is applied for each patch to find a set of lower dimensional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is Figure 5 [[File:MRF5.jpg|thumb|right|Fig.5 The compatibility between candidate scene explanations at neighboring nodes is determined by their values in their region of overlap.]]<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Therefore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the third sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furthermore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrast normalized version of the high frequency components. <br />
The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.<br />
<br />
Figure5 [[File:MRF6.jpg|center|Fig.6 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:MRF6.jpg&diff=14739File:MRF6.jpg2011-11-12T18:19:44Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:MRF5.jpg&diff=14738File:MRF5.jpg2011-11-12T18:08:17Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14737markov Random Fields for Super-Resolution2011-11-12T18:00:58Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure4 [[File:MRF4.jpg|thumb|right|Fig.4 The "image" patch and "scene" are divided into patches. For each "image" patch a collection of candidate scene patches from the training database is chosen. The final task is to find the best patch using inference on Markov networks.]] illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure….<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Threfore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the thirs sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furtheremore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrasr normalized version of the high frequency components. <br />
The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Mrf3.jpg&diff=14736File:Mrf3.jpg2011-11-12T17:48:14Z<p>Hyeganeh: uploaded a new version of &quot;File:Mrf3.jpg&quot;</p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:MRF4.jpg&diff=14735File:MRF4.jpg2011-11-12T17:44:27Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14733markov Random Fields for Super-Resolution2011-11-12T17:38:52Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.2 [[File:MRF2.jpg|thumb|right|Fig.2 Example Markov network without any loop, used for belief propagation example described in text.]] the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops Fig.3 [[File:Mrf3.jpg|thumb|right|Fig.3 Summary of results from Weiss and Freeman (1999) regarding belief propagation results after convergence.]]<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure….<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Threfore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the thirs sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furtheremore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrasr normalized version of the high frequency components. <br />
The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Mrf3.jpg&diff=14732File:Mrf3.jpg2011-11-12T17:33:50Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:MRF2.jpg&diff=14731File:MRF2.jpg2011-11-12T17:30:07Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14729markov Random Fields for Super-Resolution2011-11-12T16:32:06Z<p>Hyeganeh: /* Markov Networks for low-level vision problems */</p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the "image" and "scene" are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:MRF1.jpg|thumb|right|Fig.1 Markov network for vision problems. Each node in the network describes a local patch of image or scene. Observations, y, have underlying scene explanations, x. Lines in the graph indicate statistical dependencies between nodes.]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.3 the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops figure...<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure….<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Threfore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the thirs sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furtheremore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrasr normalized version of the high frequency components. <br />
The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:MRF1.jpg&diff=14728File:MRF1.jpg2011-11-12T16:26:46Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14727markov Random Fields for Super-Resolution2011-11-12T13:35:35Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the “image” and “scene” are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:1]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.3 the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops figure...<br />
<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure….<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the "ideal" or original high resolution image by Gaussian noise with covariance <math\sigma_i</math> and <math>\sigma_s</math>, respectively. Threfore, <math>\psi</math> function between node j and k can be represented by a matrix whose lth row and mth column is <math>\psi(x^l_k, x^m_j)</math>.<br />
Potential function <math>\phi</math> which is defined between "scene" node x and "image" node y is determined based on another intuitive assumption. We say that a "scene" candidate <math>x^k_l</math> is compatible with an observed image patch <math>y_0</math> if the image patch, <math>y^k_l</math>, associated with the scene candidate <math>x^k_l</math> in the training database matches <math>y_0</math>. Of course, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
<br />
<center><math><br />
\phi(x^l_k, y_k) = exp \frac{-|y^l_k - y_0|^2}{2\sigma^2_s}<br />
</math></center><br />
<br />
== Super-Resolution ==<br />
<br />
For the super-resolution problem, the input is a low-resolution image, and thus the “scene” to be estimated is its high resolution version. At the first glance, the task may seem impossible since the high resolution data is missing. However, the human eye is able to identify edges and sharp details in low resolution image and we know these structural information should remain at higher resolution level. The authors attempt to solve this problem using aforementioned Markov model and they name the method VISTA. <br />
There are some preprocessing steps in order to increase the efficiency of the training set. First, consider three scales Laplacian pyramid decomposition. The first sub-band, H, represents the detail in high frequency while the second and the thirs sub-bands indicate the middle, M, and the low, L, frequency components. The assumption is that high frequency band, H, is conditionally independent of the lower frequency bands, given the middle frequency band, M, yielding:<br />
<center><math><br />
P(H|M,L) = P(H|M)<br />
</math></center><br />
<br />
Hence, to predict the high frequency components, we will need the middle frequency details, M, not the low frequency band, L. This hypothesis greatly reduces the computation costs. Second, The reaserchers in this paper assume that statistical relationships between image bands are independent of image contrast. Furtheremore, they take the absolute value of the mid-frequency band, and then pass it through a lowpass filter resulting in a normalized mid frequency band. They also do the same procedure for high-frequency band. Figure .. and … show the original “scene” and the input “image” and also contrasr normalized version of the high frequency components. <br />
The “image” and “scene” images are divided into local patches. The patch size is a difficult task since choosing small size gives very little information for estimating the underlying “scene” patches. On the other hand, large patches would make the learning processs of <math>\phi</math>’s very complex. The authors use 7*7 patch size for low frequency band and 3*3 for high frequency components.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14726markov Random Fields for Super-Resolution2011-11-12T13:17:55Z<p>Hyeganeh: </p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the “image” and “scene” are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:1]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.3 the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops figure...<br />
<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation of low-level vision problems using Markov network ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 "scene" candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.<br />
<br />
=== Learning the Potential (Compatibility) Functions ===<br />
<br />
The potential functions are defined arbitrary, but they have to be introduced wisely. In this paper, a simple way is used to find potential functions. They assume “scene” patches have overlap shown is figure….<br />
Therefore, the scene patches themselves may be use to define potential functions <math>\psi</math> between nodes <math>x_i</math>’s. Recall for node <math>x_i</math> and its neighbor <math>x_j</math> there are two sets of candidate patches. Let assume the lth candidate in node j and the mth candidate in node k have some overlap. Also, we can think that the pixels in the overlapped region in <math>x_j</math> (<math>d^l_{kj}</math>) and their correspondent in <math>x_k</math> (<math>d^m_{jk}</math>) are some variation of each other , and thus eventually the potential function between node <math>x_j</math> and node <math>x_k</math> are given by:<br />
<br />
<center><math><br />
\psi(x^l_k,x^m_j) = exp \frac{-|d^l_{jk}-d^m_{kj}|}{2\sigma^2_s}<br />
</math></center><br />
<br />
Where <math>\sigma_s</math> has to be determined. The authors assume that the image and scene training samples differ from the “ideal” or original high resolution image by Gaussian noise of covariance \sigma_i and \sigma_s, respectively. Threfore, \psi function between node j and k can be represented by a matrix whose lth row and mth column is \psi(x^l_k, x^m_j).<br />
Potential function \phi which is defined between “scene” node x and “image” node y is determined based on another intuitive assumption. We say that a “scene” candidate xkl is compatible with an observed image patch y0 if the image patch, ykl, associated with the scene candidate xkl in the training database matches y0. Ofcourse, it will not exactly match, but we may suppose that the training data is “noisy” version of original image resulting in:<br />
Eq(23)</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=markov_Random_Fields_for_Super-Resolution&diff=14724markov Random Fields for Super-Resolution2011-11-12T02:42:28Z<p>Hyeganeh: Created page with "<center> A Summary on <br /> '''Markov Networks for Super-Resolution''' <br /> by <br /> W. T. Freeman and E. C. Pasztor <br /> </center> == Introduction == There are some appl..."</p>
<hr />
<div><center><br />
A Summary on <br /> <br />
'''Markov Networks for Super-Resolution''' <br /> <br />
by <br /><br />
W. T. Freeman and E. C. Pasztor <br /><br />
</center><br />
== Introduction ==<br />
There are some applications in computer vision in which the task is to infer the unobserved image called “scene” from the observed “image”. Typically, estimating the entire “scene” image at once is too complex and infeasible, and thus a common approach is to process the image regions locally and then generalize the interpretations across space. The interpretation of images can be done by modeling the relationship between local regions of “images” and “scenes”, and between neighboring local “scene” regions. The former allows us to estimate initial guess for “scene”, and the latter propagates the estimation. These problems are so-called low-level vision problems. In this paper the authors try to exploit training method using “image”/ “scene” pairs and apply the Bayesian inference of graphical models. The method is called VISTA, Vision by Image/Scene TrAining. The authors have shown the advantages of the proposed model in different practical applications. Here we focus on super-resolution application where the problem is to estimate high resolution details from low resolution images.<br />
<br />
== Markov Networks for low-level vision problems ==<br />
A common graphical model for low-level vision problems is Markov networks. For a given "image", <math>y</math>, the underlying "scene" <math>x</math> should be estimated. The posterior probability, <math> P(x|y)= cP(x,y)</math> is calculated considering the fact that the parameter <math> c = 1/P(y) </math> is a constant over <math>x</math>. The best scene estimate <math>\hat{x}</math> is the minimum mean squared error, MMSE, or the maximum a posterior, MAP. Without any approximation the <math>\hat{x}</math> is difficult to compute. Therefore the “image” and “scene” are divided into patches and one node of the Markov network is assigned to each patch. Figure [[File:1]] depicts the undirected graphical model for mentioned problem where the nodes connected by lines indicate statistical dependencies. Each “scene” node is connected to its corresponding “image” node as well as its neighbors. <br />
<br />
To make use of Markov networks the unknown parameters should be learned from training data in learning phase, and then in inference phase the “scene” estimation can be made. For a Markov random field, the joint probability over the “scene” <math>x</math> and the “image” <math> y</math> is given by:<br />
<center><math><br />
P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) = \prod_{(i,j)} \psi(x_i,x_j) \prod_{k} \phi(x_k,y_k) (1)<br />
</math></center><br />
<br />
where <math>\psi</math> and <math>\phi</math> are potential functions and they are leaned from training data. In this paper the authors prefer to call these functions compatibility functions. Then one can write the MAP and the MMSE estimates for <math\hat{x}_j</math> by marginalizing or taking the maximum over all other variables in the posterior probability, respectively. For discrete variables the expression is:<br />
<br />
<center><math><br />
\hat{x}_{jMMSE} = \sum_{x_j} \sum_{all x_i, i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N) (2)<br />
</math></center><br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} (max_{all x_i != j} P(x_1,x_2,...,x_N,y_1,y_2,...,y_N)) (3)<br />
</math></center><br />
<br />
For large networks the computation of Eq(2) and Eq(3) are infeasible to evaluate directly; however, the task is easier for network wich are trees or chains.<br />
<br />
=== Inference in Networks without loops ===<br />
<br />
For networks with no loop the inference is the simple “message-passing” rule which enables us to compute MAP and MMSE estimate. For example for the network in Fig.3 the MAP estimation for node <math>j</math> is determined by:<br />
<br />
<center><math>\begin{matrix}<br />
\hat{x}_{MAP} & = & \arg\max_{x_1} ( max_{x_2} max_{x_3} P(x_1,x_2,x_3,y_1,y_2,y_3) (4) \\<br />
& = & \arg\max_{x_1} ( max_{x_2} max_{x_3} \phi(x_1,y_1) \phi(x_2,y_2) \phi(x_3,y_3) \psi(x_1,x_2) \psi(x2,x3)\\<br />
& = & \arg\max_{x_1} \phi(x_1,y_1) (max_{x_2} \psi(x_1,x_2) \phi(x_2,y2)) (max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3)\\<br />
\end{matrix}</math></center><br />
<br />
The similar expressions for <math>x_{2MAP}</math> and <math>x_{3MAP}</math> can be used. Equations (3) and (2) can be computed by iterating the following steps. The MAP estimate at node j is <br />
<br />
<center><math><br />
\hat{x}_{jMAP} = \arg\max_{x_j} \phi(x_j, y_j) \prod_{k} M^k_j (5)<br />
</math></center><br />
<br />
Where k runs over all “scene” node neighbors of node j, and <math> M^k_j </math> is the message from node k to node j. The <math>M^k_j</math> message is calculated by:<br />
<br />
<center><math><br />
M^k_j = max_{x_k} \psi(x_j,x_k) \phi(x_k,y_k) \prod_{i!=j} \hat{M}^l_k (6)<br />
</math></center><br />
<br />
where <math>\hat{M}^l_k</math> is <math>M^k_l</math> from the previous iteration. The initial <math>\hat{M}^k_j</math>'s are set to column vector of 1’s, with the same dimension as <math>x_j</math>. Because the initial messages are 1’s at the first iteration, all the message in the network are:<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) (7)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (8)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (9)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2) (10)<br />
</math></center><br />
<br />
The second iteration uses the messages above as the <math>\hat{M}</math> variables in Eq(6) :<br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2)\hat{M}^3_2 (11)<br />
</math></center><br />
<center><math><br />
M^3_2 = max_{x_3} \psi(x_2, x_3) \phi(x_3,y_3) (12)<br />
</math></center><br />
<center><math><br />
M^2_3 = max_{x_2} \psi(x_3, x_2) \phi(x_2,y_2)\hat{M}^1_2 (13)<br />
</math></center><br />
<center><math><br />
M^1_2 = max_{x_1} \psi(x_2, x_1) \phi(x_1,y_1) (14)<br />
</math></center><br />
<br />
And thus <br />
<br />
<center><math><br />
M^2_1 = max_{x_2} \psi(x_1, x_2) \phi(x_2,y_2) * max_{x_3} \psi(x_2,x_3) \phi(x_3,y_3) (15)<br />
</math></center><br />
<br />
Eventually the MAP estimates for <math>x_1</math> becomes:<br />
<br />
<center><math><br />
\hat{x}_{1MAP} = \arg\max_{x_1} \phi(x_1,y_1)M^2_1 (16)<br />
</math></center><br />
<br />
The MMSE estimate Eq(3) has analogous formulate with the <math>max_{x_k}</math> of Eq(8) replaced by <math>\sum_{x_k}</math> and <math>\arg\max_{x_j}</math> replaced by <math>\sum_{x_j} x_j</math>.<br />
<br />
=== Networks with loops ===<br />
<br />
Although the belief propagation algorithm is derived for the networks without loops, (Weiss, 1998, Weiss and Freeman 1999;Yedidia et al. 2000) demonstrate applying the propagation rules works even in the network with loops figure...<br />
<br />
<br />
== General view of the paper ==<br />
<br />
Basically this paper aims to develop a new super-resolution scheme utilizing Markov random fields by which given low resolution image is resized to required high resolution size reproducing high frequency details. Typical interpolation techniques such as bilinear, nearest neighbor and bicubic expand the size of the low-resolution image, yet the result suffers from blurriness and in some cases blocking artifact.<br />
<br />
According to Freeman, in this method they collect many pairs of high resolution images and their corresponding low-resolution images as the training set. Given low-resolution image y a typical bicubic interpolation algorithm is employed to create a high resolution image which is interpreted as the “image” in Markov network. Of course this image does not look suitable, and thus they try to estimate original high resolution image which in we may call it “scene” image based on the definitions provided above. In order to do that, the images in training set and also the low-resolution image is divided into patches so that each patch represents the Markov network node (figure1). Therefore, <math>y_i</math>’s in figure are observed, and thus should be shaded. Subsequently, for each patch in y 10 or 20 nearest patches from the training database are selected using Euclidian distance. The ultimate job is to find the best patch in the candidate set for each patch in y using MAP estimate. In other words, the estimated scene at each patch is always some example from the training set.<br />
<br />
== Implementation ==<br />
<br />
The “image” and “scene” are arrays of pixel values, and so the complete representation is cumbersome. In this research, the principle component analysi (PCA) is applied for each patch to find a set of lower dimentional basis function. Moreover, potential functions <math>\psi</math> and <math>\phi</math> should be determined. A nice idea is to define Gaussian mixtures over joint spaces <math>x_i*x_j </math> and <math> x_j*x_k </math>;however, it is very difficult. The authors prefer a discrete representation where the most straightforward approach is to evenly sample all possible states of each image and scene variable at each patch. <br />
<br />
For each patch in "image" y a set of 10 or 20 “scene” candidates from the training set are chosen. Figure.. illustrates an example of each patch in y and the associated "scene" candidates.</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat946presentation&diff=14713f11Stat946presentation2011-11-12T00:44:18Z<p>Hyeganeh: </p>
<hr />
<div>Sign up for your presentation in the following table.<br />
Chose a date between Nov 15 and Dec 1 (inclusive).<br />
You just need to sign up your name at the moment. When you chose the paper that you would like to present, add its title and <br />
a link to the paper. <br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="5"<br />
|-<br />
|width="200pt"|Date<br />
|width="200pt"|Speaker<br />
|width="700pt"|Title<br />
|width="50pt"|Link<br />
|width="50pt"|Summary<br />
|-<br />
|-<br />
|-<br />
|Nov 15 (Presentation 1)|| Azin Ashkan || A Dynamic Bayesian Network Click Model for Web Search Ranking || [http://olivier.chapelle.cc/pub/DBN_www2009.pdf]||[[A Dynamic Bayesian Network Click Model for Web Search Ranking|Summary]]<br />
|-<br />
|-<br />
|Nov 15 (Presentation 2)|| Keyvan Golestan || Decentralised Data Fusion: A Graphical Model Approach || [http://isif.org/fusion/proceedings/fusion09CD/data/papers/0280.pdf]||[[Decentralised Data Fusion: A Graphical Model Approach (Summary)|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 1)|| Venkata Manem || Quantifying cancer progression with conjunctive Bayesian networks.|| [http://bioinformatics.oxfordjournals.org/content/25/21/2809.full.pdf] || [[Quantifying cancer progression with conjunctive Bayesian networks.|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 2)|| Mohammad Rostami ||Compressed Sensing Reconstruction via Belief Propagation ||[http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/csbpTR07142006.pdf]|| [[Compressed Sensing Reconstruction via Belief Propagation|Summary]]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 1)|| Mazen A. Melibari ||An HDP-HMM for Systems with State Persistence|| [http://www.cs.brown.edu/~sudderth/papers/icml08.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 2)||Tameem Adel|| Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Sub-cellular Location Patterns || [http://jmlr.csail.mit.edu/papers/volume9/chen08a/chen08a.pdf] || [[Graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns|Summary]]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 1)|| Pouria Fewzee || Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis || [http://mi.eng.cam.ac.uk/~ky219/papers/yu-is10.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 2)|| Ali-Akbar Samadani || ||<br />
|-<br />
|-<br />
|Nov 29 (Presentation 1)||Hojatollah Yeganeh ||Markov Random Fields for Super-Resolution ||[http://www.merl.com/reports/docs/TR2000-08.pdf]||[[Markov Random Fields for Super-Resolution|Summary]]<br />
|-<br />
|-<br />
|Nov 29 (Presentation 2)||Areej Alhothali || Video-based face recognition using adaptive hidden markov models||[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1211373]<br />
|}<br />
|}</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat946presentation&diff=14709f11Stat946presentation2011-11-11T21:58:09Z<p>Hyeganeh: </p>
<hr />
<div>Sign up for your presentation in the following table.<br />
Chose a date between Nov 15 and Dec 1 (inclusive).<br />
You just need to sign up your name at the moment. When you chose the paper that you would like to present, add its title and <br />
a link to the paper. <br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="5"<br />
|-<br />
|width="200pt"|Date<br />
|width="200pt"|Speaker<br />
|width="700pt"|Title<br />
|width="50pt"|Link<br />
|width="50pt"|Summary<br />
|-<br />
|-<br />
|-<br />
|Nov 15 (Presentation 1)|| Azin Ashkan || A Dynamic Bayesian Network Click Model for Web Search Ranking || [http://olivier.chapelle.cc/pub/DBN_www2009.pdf]||[[A Dynamic Bayesian Network Click Model for Web Search Ranking|Summary]]<br />
|-<br />
|-<br />
|Nov 15 (Presentation 2)|| Keyvan Golestan || Decentralised Data Fusion: A Graphical Model Approach || [http://isif.org/fusion/proceedings/fusion09CD/data/papers/0280.pdf]||[[Decentralised Data Fusion: A Graphical Model Approach (Summary)|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 1)|| Venkata Manem || Quantifying cancer progression with conjunctive Bayesian networks.|| [http://bioinformatics.oxfordjournals.org/content/25/21/2809.full.pdf] || [[Quantifying cancer progression with conjunctive Bayesian networks.|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 2)|| Mohammad Rostami ||Compressed Sensing Reconstruction via Belief Propagation ||[http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/csbpTR07142006.pdf]|| [[Compressed Sensing Reconstruction via Belief Propagation|Summary]]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 1)|| Mazen A. Melibari ||An HDP-HMM for Systems with State Persistence|| [http://www.cs.brown.edu/~sudderth/papers/icml08.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 2)||Tameem Adel|| Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Sub-cellular Location Patterns || [http://jmlr.csail.mit.edu/papers/volume9/chen08a/chen08a.pdf] || [[Graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns|Summary]]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 1)|| Pouria Fewzee || Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis || [http://mi.eng.cam.ac.uk/~ky219/papers/yu-is10.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 2)|| Ali-Akbar Samadani || ||<br />
|-<br />
|-<br />
|Nov 29 (Presentation 1)||Hojatollah Yeganeh ||Markov Random Fields for Super-Resolution ||[http://www.merl.com/reports/docs/TR2000-08.pdf]<br />
|-<br />
|-<br />
|Nov 29 (Presentation 2)||Areej Alhothali || Video-based face recognition using adaptive hidden markov models||[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1211373]<br />
|}<br />
|}</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat946presentation&diff=14530f11Stat946presentation2011-11-10T16:42:23Z<p>Hyeganeh: </p>
<hr />
<div>Sign up for your presentation in the following table.<br />
Chose a date between Nov 15 and Dec 1 (inclusive).<br />
You just need to sign up your name at the moment. When you chose the paper that you would like to present, add its title and <br />
a link to the paper. <br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="5"<br />
|-<br />
|width="200pt"|Date<br />
|width="200pt"|Speaker<br />
|width="700pt"|Title<br />
|width="50pt"|Link<br />
|width="50pt"|Summary<br />
|-<br />
|-<br />
|-<br />
|Nov 15 (Presentation 1)|| Azin Ashkan || A Dynamic Bayesian Network Click Model for Web Search Ranking || [http://olivier.chapelle.cc/pub/DBN_www2009.pdf]||[[A Dynamic Bayesian Network Click Model for Web Search Ranking|Summary]]<br />
|-<br />
|-<br />
|Nov 15 (Presentation 2)|| Keyvan Golestan || Decentralised Data Fusion: A Graphical Model Approach || [http://isif.org/fusion/proceedings/fusion09CD/data/papers/0280.pdf]||[[Decentralised Data Fusion: A Graphical Model Approach (Summary)|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 1)|| Venkata Manem || Quantifying cancer progression with conjunctive Bayesian networks.|| [http://bioinformatics.oxfordjournals.org/content/25/21/2809.full.pdf] || [[Quantifying cancer progression with conjunctive Bayesian networks.|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 2)|| Mohammad Rostami ||Compressed Sensing Reconstruction via Belief Propagation ||[http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/csbpTR07142006.pdf]|| [http://www.wikicoursenote.com/wiki/Compressed_Sensing_Reconstruction_via_Belief_Propagation#Approximate_solution_to_CS_statistical_inference_via_message_passing]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 1)|| Mazen A. Melibari ||An HDP-HMM for Systems with State Persistence|| [http://www.cs.brown.edu/~sudderth/papers/icml08.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 2)||Tameem Adel|| Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Sub-cellular Location Patterns || [http://jmlr.csail.mit.edu/papers/volume9/chen08a/chen08a.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 1)|| Pouria Fewzee || Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis || [http://mi.eng.cam.ac.uk/~ky219/papers/yu-is10.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 2)|| Ali-Akbar Samadani || ||<br />
|-<br />
|-<br />
|Nov 29 (Presentation 1)||Hojatollah Yeganeh || Learning low-level vision ||[http://www.springerlink.com/content/k5h56415318kwqx5/]<br />
|-<br />
|-<br />
|Nov 29 (Presentation 2)||Areej Alhothali || Video-based face recognition using adaptive hidden markov models||[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1211373]<br />
|}<br />
|}</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat946presentation&diff=14415f11Stat946presentation2011-11-09T16:39:26Z<p>Hyeganeh: </p>
<hr />
<div>Sign up for your presentation in the following table.<br />
Chose a date between Nov 15 and Dec 1 (inclusive).<br />
You just need to sign up your name at the moment. When you chose the paper that you would like to present, add its title and <br />
a link to the paper. <br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="5"<br />
|-<br />
|width="200pt"|Date<br />
|width="200pt"|Speaker<br />
|width="700pt"|Title<br />
|width="50pt"|Link<br />
|width="50pt"|Summary<br />
|-<br />
|-<br />
|-<br />
|Nov 15 (Presentation 1)|| Azin Ashkan || A Dynamic Bayesian Network Click Model for Web Search Ranking || [http://olivier.chapelle.cc/pub/DBN_www2009.pdf]||[[A Dynamic Bayesian Network Click Model for Web Search Ranking|Summary]]<br />
|-<br />
|-<br />
|Nov 15 (Presentation 2)|| Keyvan Golestan || Decentralised Data Fusion: A Graphical Model Approach || [http://isif.org/fusion/proceedings/fusion09CD/data/papers/0280.pdf]||[[Decentralised Data Fusion: A Graphical Model Approach (Summary)|Summary]]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 1)|| Venkata Manem || Quantifying cancer progression with conjunctive Bayesian networks.|| [http://bioinformatics.oxfordjournals.org/content/25/21/2809.full.pdf]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 2)|| Mohammad Rostami ||Compressed Sensing Reconstruction via Belief Propagation ||[http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/csbpTR07142006.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 1)|| Mazen A. Melibari ||An HDP-HMM for Systems with State Persistence|| [http://www.cs.brown.edu/~sudderth/papers/icml08.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 2)||Tameem Adel|| Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Sub-cellular Location Patterns || [http://jmlr.csail.mit.edu/papers/volume9/chen08a/chen08a.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 1)|| Pouria Fewzee || Context Adaptive Training with Factorized Decision Trees for HMM-Based Speech Synthesis || [http://mi.eng.cam.ac.uk/~ky219/papers/yu-is10.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 2)|| Ali-Akbar Samadani || ||<br />
|-<br />
|-<br />
|Nov 29 (Presentation 1)||Hojatollah Yeganeh || Learning low-level vision ||[http://people.csail.mit.edu/billf/papers/TR2000-05.pdf]<br />
|-<br />
|-<br />
|Nov 29 (Presentation 2)||Areej Alhothali || Video-based face recognition using adaptive hidden markov models||[http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1211373]<br />
|}<br />
|}</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat946presentation&diff=13875f11Stat946presentation2011-11-04T21:19:34Z<p>Hyeganeh: </p>
<hr />
<div>Sign up for your presentation in the following table.<br />
Chose a date between Nov 15 and Dec 1 (inclusive).<br />
You just need to sign up your name at the moment. When you chose the paper that you would like to present, add its title and <br />
a link to the paper. <br />
<br />
<br />
{| class="wikitable"<br />
<br />
{| border="1" cellpadding="4"<br />
|-<br />
|width="200pt"|Date<br />
|width="200pt"|Speaker<br />
|width="700pt"|Title<br />
|width="50pt"|Link<br />
|-<br />
|-<br />
|-<br />
|Nov 15 (Presentation 1)|| <br />
|-<br />
|-<br />
|Nov 15 (Presentation 2)|| Keyvan Golestan || Decentralised Data Fusion: A Graphical Model Approach || [http://isif.org/fusion/proceedings/fusion09CD/data/papers/0280.pdf]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 1)|| Venkata Manem || Gene finding with a hidden Markov model of genome structure and evolution|| [http://bioinformatics.oxfordjournals.org/content/19/2/219.full.pdf]<br />
|-<br />
|-<br />
|Nov 17 (Presentation 2)|| Mohammad Rostami ||Compressed Sensing Reconstruction via Belief Propagation ||[http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/csbpTR07142006.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 1)|| Mazen A. Melibari ||Learning the Structure of Deep Sparse Graphical Models|| [http://www.cs.toronto.edu/~rpa/papers/adams-wallach-ghahramani-2010a.pdf]<br />
|-<br />
|-<br />
|Nov 22 (Presentation 2)||Tameem Adel|| Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Sub-cellular Location Patterns || [http://jmlr.csail.mit.edu/papers/volume9/chen08a/chen08a.pdf]<br />
|-<br />
|-<br />
|Nov 24 (Presentation 1)|| Pouria Fewzee || ||<br />
|-<br />
|-<br />
|Nov 24 (Presentation 2)|| Ali-Akbar Samadani || ||<br />
|-<br />
|-<br />
|Nov 29 (Presentation 1)||Hojatollah Yeganeh ||Single-image super-resolution based on Markov random field and contourlet transform ||[http://spiedigitallibrary.org/jei/resource/1/jeime5/v20/i2/p023005_s1]<br />
|-<br />
|-<br />
|Nov 29 (Presentation 2)||Areej Alhothali || ||<br />
|-<br />
|-<br />
|Dec 1 (Presentation 1)|| || ||<br />
|-<br />
|-<br />
|Dec 1 (Presentation 2)|| Azin Ashkan || ||<br />
|}<br />
|}</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13487stat946f112011-10-28T02:21:19Z<p>Hyeganeh: /* Jensen's Inequality */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.40 which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.40 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. The posterior <math> P(z=1|x) </math> can be easily computed using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posterior (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Consider a probabilistic model in which we collectively denote all of the observed variables by X and all of the hidden variables by Z resulting in a simple graphical model with two nodes (Fig. 40). The joint distribution<br />
<math> p(X,Z|θ) </math> is governed by a set of parameters,θ. The task is to maximize the likelihood function that is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. 41) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:inequality.png|thumb|right|Fig.41 Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
In the first step we assume <math> \theta </math> is known and then the goal is to find <math> q </math> to maximize the lower bound. Second, suppose <math> q </math> is known and find the <math> \theta </math>. In other words:<br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> p(z|x,\theta^{(t)})</math>. Replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
The EM algorithm is a two-stage iterative optimization technique for finding<br />
maximum likelihood solutions. Suppose that the current value of the parameter vector is <math> \theta^t </math>. In the E step, the<br />
lower bound <math> F(q, \theta^t) </math> is maximized with respect to <math> q(z|x) </math> while <math> \theta^t </math> is fixed.<br />
As was mentioned above the solution to this maximization problem is to set the <math> q(z|x) </math> to <math> p(z|x,\theta^t) </math> since the value of incomplete likelihood,<math> log p(X|\theta^t) </math> does not depend on <math> q(z|x) </math> and so the largest value of <math> F(q, \theta^t) </math> will be achieved using this parameter. In this case the lower bound will equal the incomplete log likelihood.<br />
<br />
=== Alternative steps for the EM algorithms ===<br />
From the above results we can find an alternative representation for the EM algorithm reproducing it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm.<br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13425stat946f112011-10-27T20:48:33Z<p>Hyeganeh: /* Introduction */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.40 which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.40 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. The posterior <math> P(z=1|x) </math> can be easily computed using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posterior (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Consider a probabilistic model in which we collectively denote all of the observed variables by X and all of the hidden variables by Z resulting in a simple graphical model with two nodes (Fig. 40). The joint distribution<br />
<math> p(X,Z|θ) </math> is governed by a set of parameters,θ. The task is to maximize the likelihood function that is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. 41) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:inequality.png|thumb|right|Fig.41 Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
In the first step we assume <math> \theta <math> is known and then the goal is to find <math> q </math> to maximize the lower bound. Second, suppose <math> q </math> is known and find the <math> \theta </math>. In other words:<br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> p(z|x,\theta^{(t)})</math>. Replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
The EM algorithm is a two-stage iterative optimization technique for finding<br />
maximum likelihood solutions. Suppose that the current value of the parameter vector is <math> \theta^t </math>. In the E step, the<br />
lower bound <math> F(q, \theta^t) </math> is maximized with respect to <math> q(z|x) </math> while <math> \theta^t </math> is fixed.<br />
As was mentioned above the solution to this maximization problem is to set the <math> q(z|x) </math> to <math> p(z|x,\theta^t) </math> since the value of incomplete likelihood,<math> log p(X|\theta^t) </math> does not depend on <math> q(z|x) </math> and so the largest value of <math> F(q, \theta^t) </math> will be achieved using this parameter. In this case the lower bound will equal the incomplete log likelihood.<br />
<br />
=== Alternative steps for the EM algorithms ===<br />
From the above results we can find an alternative representation for the EM algorithm reproducing it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm.<br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13362stat946f112011-10-27T14:33:05Z<p>Hyeganeh: /* EM Algorithm */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.40 which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.40 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posterior (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Consider a probabilistic model in which we collectively denote all of the observed variables by X and all of the hidden variables by Z resulting in a simple graphical model with two nodes (Fig. 40). The joint distribution<br />
<math> p(X,Z|θ) </math> is governed by a set of parameters,θ. The task is to maximize the likelihood function that is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. 41) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:inequality.png|thumb|right|Fig.41 Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
In the first step we assume <math> \theta <math> is known and then the goal is to find <math> q </math> to maximize the lower bound. Second, suppose <math> q </math> is known and find the <math> \theta </math>. In other words:<br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> p(z|x,\theta^{(t)})</math>. Replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
The EM algorithm is a two-stage iterative optimization technique for finding<br />
maximum likelihood solutions. Suppose that the current value of the parameter vector is <math> \theta^t </math>. In the E step, the<br />
lower bound <math> F(q, \theta^t) </math> is maximized with respect to <math> q(z|x) </math> while <math> \theta^t </math> is fixed.<br />
As was mentioned above the solution to this maximization problem is to set the <math> q(z|x) </math> to <math> p(z|x,\theta^t) </math> since the value of incomplete likelihood,<math> log p(X|\theta^t) </math> does not depend on <math> q(z|x) </math> and so the largest value of <math> F(q, \theta^t) </math> will be achieved using this parameter. In this case the lower bound will equal the incomplete log likelihood.<br />
<br />
=== Alternative steps for the EM algorithms ===<br />
From the above results we can find an alternative representation for the EM algorithm reproducing it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm.<br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13359stat946f112011-10-27T14:30:30Z<p>Hyeganeh: /* EM Method */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.40 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posterior (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Consider a probabilistic model in which we collectively denote all of the observed variables by X and all of the hidden variables by Z resulting in a simple graphical model with two nodes (Fig. 39). The joint distribution<br />
<math> p(X,Z|θ) </math> is governed by a set of parameters,θ. The task is to maximize the likelihood function that is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:inequality.png|thumb|right|Fig.41 Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
In the first step we assume <math> \theta <math> is known and then the goal is to find <math> q </math> to maximize the lower bound. Second, suppose <math> q </math> is known and find the <math> \theta </math>. In other words:<br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> p(z|x,\theta^{(t)})</math>. Replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
The EM algorithm is a two-stage iterative optimization technique for finding<br />
maximum likelihood solutions. Suppose that the current value of the parameter vector is <math> \theta^t </math>. In the E step, the<br />
lower bound <math> F(q, \theta^t) </math> is maximized with respect to <math> q(z|x) </math> while <math> \theta^t </math> is fixed.<br />
As was mentioned above the solution to this maximization problem is to set the <math> q(z|x) </math> to <math> p(z|x,\theta^t) </math> since the value of incomplete likelihood,<math> log p(X|\theta^t) </math> does not depend on <math> q(z|x) </math> and so the largest value of <math> F(q, \theta^t) </math> will be achieved using this parameter. In this case the lower bound will equal the incomplete log likelihood.<br />
<br />
=== Alternative steps for the EM algorithms ===<br />
From the above results we can find an alternative representation for the EM algorithm reproducing it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm.<br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13339stat946f112011-10-26T21:45:27Z<p>Hyeganeh: /* EM Algorithm */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.40 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:inequality.png|thumb|right|Fig.41 Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13338stat946f112011-10-26T21:44:03Z<p>Hyeganeh: /* Jensen's Inequality */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.1 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:inequality.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Inequality.png&diff=13337File:Inequality.png2011-10-26T21:43:02Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13336stat946f112011-10-26T21:30:28Z<p>Hyeganeh: /* EM Algorithm */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:GMwithLatent.png|thumb|right|Fig.1 A simple graphical model with a latent variable.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:GMwithLatent.png&diff=13335File:GMwithLatent.png2011-10-26T21:26:29Z<p>Hyeganeh: </p>
<hr />
<div></div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13334stat946f112011-10-26T21:25:12Z<p>Hyeganeh: /* Introduction */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
[[File:dired.png|thumb|right|Fig.1 A directed graph.]]<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13331stat946f112011-10-26T20:59:19Z<p>Hyeganeh: /* Introduction */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
<br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13330stat946f112011-10-26T20:56:03Z<p>Hyeganeh: /* E-Step Explanation */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(q;\theta)</math>. Therefore we only need to do the E-Step once and then use the result for each iteration of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13329stat946f112011-10-26T20:54:20Z<p>Hyeganeh: /* Notes About E-Step */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
==== E-Step Explanation ====<br />
<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximize <math> F(q;\theta^{(t)})</math>.<br />
<center><math><br />
F(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
<br />
'''Claim:''' It can be shown that to maximize the auxiliary function one should set <math>q(z|x)</math> to <math> (z|x,\theta^{(t)})</math>, and thus replacing <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math> results in:<br />
<center><math>\begin{matrix}<br />
F(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
Recall that <math>F(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> determines that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>F(</math>. Therefore we only need to do the E-Step once and then we can use that result for each repetition of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13327stat946f112011-10-26T20:43:28Z<p>Hyeganeh: /* M-Step */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step Explanation ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
====Notes About E-Step====<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximise <math>\mathfrak{L}(q;\theta^{(t)})</math>.<br />
<center><math><br />
\mathfrak{L}(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
It can be shown that <math>q(z|x) = P(z|x,\theta^{(t)})</math>. So, replace <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math>.<br />
<center><math>\begin{matrix}<br />
\mathfrak{L}(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
But <math>\mathfrak{L}(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> so that means that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>\mathfrak{L}</math>. We can therefore see that we only need to do the E-Step once and then we can use that result for each repetition of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganehhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat946f11&diff=13326stat946f112011-10-26T20:42:26Z<p>Hyeganeh: /* M-Step */</p>
<hr />
<div>==[[f11stat946EditorSignUp| Editor Sign Up]]==<br />
==[[f11Stat946presentation| Sign up for your presentation]]==<br />
==[[f11Stat946ass| Assignments]]==<br />
==Introduction==<br />
===Motivation===<br />
Graphical probabilistic models provide a concise representation of various probabilistic distributions that are found in many<br />
real world applications. Some interesting areas include medical diagnosis, computer vision, language, analyzing gene expression <br />
data, etc. A problem related to medical diagnosis is, "detecting and quantifying the causes of a disease". This question can<br />
be addressed through the graphical representation of relationships between various random variables (both observed and hidden).<br />
This is an efficient way of representing a joint probability distribution.<br />
<br />
Graphical models are excellent tools to burden the computational load of probabilistic models. Suppose we want to model a binary image. If we have 256 by 256 image then our distribution function has <math>2^{256*256}=2^{65536}</math> outcomes. Even very simple tasks such as marginalization of such a probability distribution over some variables can be computationally intractable and the load grows exponentially versus number of the variables. In practice and in real world applications we generally have some kind of dependency or relation between the variables. Using such information, can help us to simplify the calculations. For example for the same problem if all the image pixels can be assumed to be independent, marginalization can be done easily. One of the good tools to depict such relations are graphs. Using some rules we can indicate a probability distribution uniquely by a graph, and then it will be easier to study the graph instead of the probability distribution function (PDF). We can take advantage of graph theory tools to design some algorithms. Though it may seem simple but this approach will simplify the commutations and as mentioned help us to solve a lot of problems in different research areas.<br />
<br />
===Notation===<br />
<br />
We will begin with short section about the notation used in these notes.<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
===Example===<br />
Let <math>A = \{1,4\}</math>, so <math>X_A = \{X_1, X_4\}</math>; <math>A</math> is the set of indices for<br />
the r.v. <math>X_A</math>.<br /><br />
Also let <math>B = \{2\},\ X_B = \{X_2\}</math> so we can write<br />
<center><math>P( X_A | X_B ) = P( X_1 = x_1, X_4 = x_4 | X_2 = x_2 ).\,\!</math></center><br />
<br />
===Graphical Models===<br />
Graphical models provide a compact representation of the joint distribution where V vertices (nodes) represent random variables and edges E represent the dependency between the variables. There are two forms of graphical models (Directed and Undirected graphical model). Directed graphical (Figure 1) models consist of arcs and nodes where arcs indicate that the parent is a explanatory variable for the child. Undirected graphical models (Figure 2) are based on the assumptions that two nodes or two set of nodes are conditionally independent given their neighbour[http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 1].<br />
<br />
Similiar types of analysis predate the area of Probablistic Graphical Models and it's terminology. Bayesian Network and Belief Network are preceeding terms used to a describe directed acyclical graphical model. Similarly Markov Random Field (MRF) and Markov Network are preceeding terms used to decribe a undirected graphical model. Probablistic Graphical Models have united some of the theory from these older theories and allow for more generalized distributions than were possible in the previous methods.<br />
<br />
[[File:directed.png|thumb|right|Fig.1 A directed graph.]]<br />
[[File:undirected.png|thumb|right|Fig.2 An undirected graph.]]<br />
<br />
We will use graphs in this course to represent the relationship between different random variables. <br />
{{Cleanup|date=October 2011|reason= It is worth noting that both Bayesian networks and Markov networks existed before introduction of graphical models but graphical models helps us to provide a unified theory for both cases and more generalized distributions.}}<br />
<br />
====Directed graphical models (Bayesian networks)====<br />
<br />
In the case of directed graphs, the direction of the arrow indicates "causation". This assumption makes these networks useful for the cases that we want to model causality. So these models are more useful for applications such as computational biology and bioinformatics, where we study effect (cause) of some variables on another variable. For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>A\,\!</math> "causes" <math>B\,\!</math>.<br />
<br />
In this case we must assume that our directed graphs are ''acyclic''. An example of an acyclic graphical model from medicine is shown in Figure 2a.<br />
[[File:acyclicgraph.png|thumb|right|Fig.2a Sample acyclic directed graph.]]<br />
<br />
Exposure to ionizing radiation (such as CT scans, X-rays, etc) and also to environment might lead to gene mutations that eventually give rise to cancer. Figure 2a can be called as a causation graph.<br />
<br />
If our causation graph contains a cycle then it would mean that for example:<br />
<br />
* <math>A</math> causes <math>B</math><br />
* <math>B</math> causes <math>C</math><br />
* <math>C</math> causes <math>A</math>, again. <br />
<br />
Clearly, this would confuse the order of the events. An example of a graph with a cycle can be seen in Figure 3. Such a graph could not be used to represent causation. The graph in Figure 4 does not have cycle and we can say that the node <math>X_1</math> causes, or affects, <math>X_2</math> and <math>X_3</math> while they in turn cause <math>X_4</math>.<br />
<br />
[[File:cyclic.png|thumb|right|Fig.3 A cyclic graph.]]<br />
[[File:acyclic.png|thumb|right|Fig.4 An acyclic graph.]]<br />
<br />
In directed acyclic graphical models each vertex represents a random variable; a random variable associated with one vertex is distinct from the random variables associated with other vertices. Consider the following example that uses boolean random variables. It is important to note that the variables need not be boolean and can indeed be discrete over a range or even continuous.<br />
<br />
Speaking about random variables, we can now refer to the relationship between random variables in terms of dependence. Therefore, the direction of the arrow indicates "conditional dependence". For example:<br />
<br /><br />
<math>A \longrightarrow B</math>: <math>B\,\!</math> "is dependent on" <math>A\,\!</math>.<br />
<br />
Note if we do not have any conditional independence, the corresponding graph will be complete, i.e., all possible edges will be present. Whereas if we have full independence our graph will have no edge. Between these two extreme cases there exist a large class. Graphical models are more useful when the graph be sparse, i.e., only a small number of edges exist. The topology of this graph is important and later we will see some examples that we can use graph theory tools to solve some probabilistic problems. On the other hand this representation makes it easier to model causality between variables in real world phenomena.<br />
<br />
====Example====<br />
<br />
In this example we will consider the possible causes for wet grass. <br />
<br />
The wet grass could be caused by rain, or a sprinkler. Rain can be caused by clouds. On the other hand one can not say that clouds cause the use of a sprinkler. However, the causation exists because the presence of clouds does affect whether or not a sprinkler will be used. If there are more clouds there is a smaller probability that one will rely on a sprinkler to water the grass. As we can see from this example the relationship between two variables can also act like a negative correlation. The corresponding graphical model is shown in Figure 5.<br />
<br />
[[File:wetgrass.png|thumb|right|Fig.5 The wet grass example.]]<br />
<br />
This directed graph shows the relation between the 4 random variables. If we have<br />
the joint probability <math>P(C,R,S,W)</math>, then we can answer many queries about this<br />
system.<br />
<br />
This all seems very simple at first but then we must consider the fact that in the discrete case the joint probability function grows exponentially with the number of variables. If we consider the wet grass example once more we can see that we need to define <math>2^4 = 16</math> different probabilities for this simple example. The table bellow that contains all of the probabilities and their corresponding boolean values for each random variable is called an ''interaction table''.<br />
<br />
'''Example:'''<br />
<center><math>\begin{matrix}<br />
P(C,R,S,W):\\<br />
p_1\\<br />
p_2\\<br />
p_3\\<br />
.\\<br />
.\\<br />
.\\<br />
p_{16} \\ \\<br />
\end{matrix}</math></center><br />
<br /><br /><br />
<center><math>\begin{matrix}<br />
~~~ & C & R & S & W \\<br />
& 0 & 0 & 0 & 0 \\<br />
& 0 & 0 & 0 & 1 \\<br />
& 0 & 0 & 1 & 0 \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& . & . & . & . \\<br />
& 1 & 1 & 1 & 1 \\<br />
\end{matrix}</math></center><br />
<br />
Now consider an example where there are not 4 such random variables but 400. The interaction table would become too large to manage. In fact, it would require <math>2^{400}</math> rows! The purpose of the graph is to help avoid this intractability by considering only the variables that are directly related. In the wet grass example Sprinkler (S) and Rain (R) are not directly related. <br />
<br />
To solve the intractability problem we need to consider the way those relationships are represented in the graph. Let us define the following parameters. For each vertex <math>i \in V</math>,<br />
<br />
* <math>\pi_i</math>: is the set of parents of <math>i</math> <br />
** ex. <math>\pi_R = C</math> \ (the parent of <math>R = C</math>) <br />
* <math>f_i(x_i, x_{\pi_i})</math>: is the joint p.d.f. of <math>i</math> and <math>\pi_i</math> for which it is true that:<br />
** <math>f_i</math> is nonnegative for all <math>i</math><br />
** <math>\displaystyle\sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math><br />
<br />
'''Claim''': There is a family of probability functions <math> P(X_V) = \prod_{i=1}^n f_i(x_i, x_{\pi_i})</math> where this function is nonnegative, and<br />
<center><math><br />
\sum_{x_1}\sum_{x_2}\cdots\sum_{x_n} P(X_V) = 1<br />
</math></center><br />
<br />
To show the power of this claim we can prove the equation (\ref{eqn:WetGrass}) for our wet grass example:<br />
<center><math>\begin{matrix}<br />
P(X_V) &=& P(C,R,S,W) \\<br />
&=& f(C) f(R,C) f(S,C) f(W,S,R)<br />
\end{matrix}</math></center><br />
<br />
We want to show that<br />
<center><math>\begin{matrix}<br />
\sum_C\sum_R\sum_S\sum_W P(C,R,S,W) & = &\\<br />
\sum_C\sum_R\sum_S\sum_W f(C) f(R,C)<br />
f(S,C) f(W,S,R) <br />
& = & 1.<br />
\end{matrix}</math></center><br />
<br />
Consider factors <math>f(C)</math>, <math>f(R,C)</math>, <math>f(S,C)</math>: they do not depend on <math>W</math>, so we<br />
can write this all as<br />
<center><math>\begin{matrix}<br />
& & \sum_C\sum_R\sum_S f(C) f(R,C) f(S,C) \cancelto{1}{\sum_W f(W,S,R)} \\<br />
& = & \sum_C\sum_R f(C) f(R,C) \cancelto{1}{\sum_S f(S,C)} \\<br />
& = & \cancelto{1}{\sum_C f(C)} \cancelto{1}{\sum_R f(R,C)} \\<br />
& = & 1<br />
\end{matrix}</math></center><br />
<br />
since we had already set <math>\displaystyle \sum_{x_i} f_i(x_i, x_{\pi_i}) = 1</math>.<br />
<br />
Let us consider another example with a different directed graph. <br /><br />
'''Example:'''<br /><br />
Consider the simple directed graph in Figure 6.<br />
<br />
[[File:1234.png|thumb|right|Fig.6 Simple 4 node graph.]]<br />
<br />
Assume that we would like to calculate the following: <math> p(x_3|x_2) </math>. We know that we can write the joint probability as:<br />
<center><math> p(x_1,x_2,x_3,x_4) = f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \,\!</math></center><br />
<br />
We can also make use of Bayes' Rule here: <br />
<br />
<center><math>p(x_3|x_2) = \frac{p(x_2,x_3)}{ p(x_2)}</math></center><br />
<br />
<center><math>\begin{matrix}<br />
p(x_2,x_3) & = & \sum_{x_1} \sum_{x_4} p(x_1,x_2,x_3,x_4) ~~~~ \hbox{(marginalization)} \\<br />
& = & \sum_{x_1} \sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2) f(x_4,x_3) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1) f(x_3,x_2) \cancelto{1}{\sum_{x_4}f(x_4,x_3)} \\<br />
& = & f(x_3,x_2) \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
We also need<br />
<center><math>\begin{matrix}<br />
p(x_2) & = & \sum_{x_1}\sum_{x_3}\sum_{x_4} f(x_1) f(x_2,x_1) f(x_3,x_2)<br />
f(x_4,x_3) \\<br />
& = & \sum_{x_1}\sum_{x_3} f(x_1) f(x_2,x_1) f(x_3,x_2) \\<br />
& = & \sum_{x_1} f(x_1) f(x_2,x_1).<br />
\end{matrix}</math></center><br />
<br />
Thus,<br />
<center><math>\begin{matrix}<br />
p(x_3|x_2) & = & \frac{ f(x_3,x_2) \sum_{x_1} f(x_1)<br />
f(x_2,x_1)}{ \sum_{x_1} f(x_1) f(x_2,x_1)} \\<br />
& = & f(x_3,x_2).<br />
\end{matrix}</math></center><br />
<br />
'''Theorem 1.'''<br />
<center><math>f_i(x_i,x_{\pi_i}) = p(x_i|x_{\pi_i}).\,\!</math></center><br />
<center><math> \therefore \ P(X_V) = \prod_{i=1}^n p(x_i|x_{\pi_i})\,\!</math></center>.<br />
<br />
In our simple graph, the joint probability can be written as <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1)p(x_2|x_1) p(x_3|x_2) p(x_4|x_3).\,\!</math></center><br />
<br />
Instead, had we used the chain rule we would have obtained a far more complex equation: <br />
<center><math>p(x_1,x_2,x_3,x_4) = p(x_1) p(x_2|x_1)p(x_3|x_2,x_1) p(x_4|x_3,x_2,x_1).\,\!</math></center><br />
<br />
The ''Markov Property'', or ''Memoryless Property'' is when the variable <math>X_i</math> is only affected by <math>X_j</math> and so the random variable <math>X_i</math> given <math>X_j</math> is independent of every other random variable. In our example the history of <math>x_4</math> is completely determined by <math>x_3</math>. <br /><br />
By simply applying the Markov Property to the chain-rule formula we would also have obtained the same result.<br />
<br />
Now let us consider the joint probability of the following six-node example found in Figure 7.<br />
<br />
[[File:ClassicExample1.png|thumb|right|Fig.7 Six node example.]]<br />
<br />
If we use Theorem 1 it can be seen that the joint probability density function for Figure 7 can be written as follows: <br />
<center><math> P(X_1,X_2,X_3,X_4,X_5,X_6) = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) \,\!</math></center><br />
<br />
Once again, we can apply the Chain Rule and then the Markov Property and arrive at the same result.<br />
<br />
<center><math>\begin{matrix}<br />
&& P(X_1,X_2,X_3,X_4,X_5,X_6) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_2,X_1)P(X_4|X_3,X_2,X_1)P(X_5|X_4,X_3,X_2,X_1)P(X_6|X_5,X_4,X_3,X_2,X_1) \\<br />
&& = P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2) <br />
\end{matrix}</math></center><br />
<br />
===Independence=== <br />
<br />
====Marginal independence====<br />
We can say that <math>X_A</math> is marginally independent of <math>X_B</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B : & & \\<br />
P(X_A,X_B) & = & P(X_A)P(X_B) \\<br />
P(X_A|X_B) & = & P(X_A) <br />
\end{matrix}</math></center><br />
<br />
====Conditional independence====<br />
We can say that <math>X_A</math> is conditionally independent of <math>X_B</math> given <math>X_C</math> if:<br />
<center><math>\begin{matrix}<br />
X_A \perp X_B | X_C : & & \\<br />
P(X_A,X_B | X_C) & = & P(X_A|X_C)P(X_B|X_C) \\<br />
P(X_A|X_B,X_C) & = & P(X_A|X_C) <br />
\end{matrix}</math></center><br />
Note: Both equations are equivalent.<br />
'''Aside:''' Before we move on further, we first define the following terms:<br />
# I is defined as an ordering for the nodes in graph C.<br />
# For each <math>i \in V</math>, <math>V_i</math> is defined as a set of all nodes that appear earlier than i excluding its parents <math>\pi_i</math>.<br />
<br />
Let us consider the example of the six node figure given above (Figure 7). We can define <math>I</math> as follows: <br />
<center><math>I = \{1,2,3,4,5,6\} \,\!</math></center><br />
We can then easily compute <math>V_i</math> for say <math>i=3,6</math>. <br /><br />
<center><math> V_3 = \{2\}, V_6 = \{1,3,4\}\,\!</math></center> <br />
while <math>\pi_i</math> for <math> i=3,6</math> will be. <br /><br />
<center><math> \pi_3 = \{1\}, \pi_6 = \{2,5\}\,\!</math></center> <br />
<br />
We would be interested in finding the conditional independence between random variables in this graph. We know <math>X_i \perp X_{v_i} | X_{\pi_i}</math> for each <math>i</math>. In other words, given its parents the node is independent of all earlier nodes. So:<br /><br />
<math>X_1 \perp \phi | \phi</math>, <br /><br />
<math>X_2 \perp \phi | X_1</math>, <br /><br />
<math>X_3 \perp X_2 | X_1</math>, <br /><br />
<math>X_4 \perp \{X_1,X_3\} | X_2</math>, <br /><br />
<math>X_5 \perp \{X_1,X_2,X_4\} | X_3</math>, <br /><br />
<math>X_6 \perp \{X_1,X_3,X_4\} | \{X_2,X_5\}</math> <br /><br />
To illustrate why this is true we can take a simple example. Show that:<br />
<center><math>P(X_4|X_1,X_2,X_3) = P(X_4|X_2)\,\!</math></center><br />
<br />
Proof: first, we know <br />
<math>P(X_1,X_2,X_3,X_4,X_5,X_6)<br />
= P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)P(X_5|X_3)P(X_6|X_5,X_2)\,\!</math><br />
<br />
then<br />
<center><math>\begin{matrix}<br />
P(X_4|X_1,X_2,X_3) & = & \frac{P(X_1,X_2,X_3,X_4)}{P(X_1,X_2,X_3)}\\<br />
& = & \frac{ \sum_{X_5} \sum_{X_6} P(X_1,X_2,X_3,X_4,X_5,X_6)}{ \sum_{X_4} \sum_{X_5} \sum_{X_6}P(X_1,X_2,X_3,X_4,X_5,X_6)}\\<br />
& = & \frac{P(X_1)P(X_2|X_1)P(X_3|X_1)P(X_4|X_2)}{P(X_1)P(X_2|X_1)P(X_3|X_1)}\\<br />
& = & P(X_4|X_2)<br />
\end{matrix}</math></center><br />
<br />
The other conditional independences can be proven through a similar process.<br />
<br />
====Sampling====<br />
Even if using graphical models helps a lot facilitate obtaining the joint probability, exact inference is not always feasible. Exact inference is feasible in small to medium-sized networks only. Exact inference consumes such a long time in large networks. Therefore, we resort to approximate inference techniques which are much faster and usually give pretty good results.<br />
<br />
In sampling, random samples are generated and values of interest are computed from samples, not original work.<br />
<br />
As an input you have a Bayesian network with set of nodes <math>X\,\!</math>. The sample taken may include all variables (except evidence E) or a subset. Sample schemas dictate how to generate samples (tuples). Ideally samples are distributed according to <math>P(X|E)\,\!</math><br />
<br />
Some sampling algorithms:<br />
* Forward Sampling<br />
* Likelihood weighting<br />
* Gibbs Sampling (MCMC)<br />
** Blocking<br />
** Rao-Blackwellised<br />
* Importance Sampling<br />
<br />
==Bayes Ball== <br />
The Bayes Ball algorithm can be used to determine if two random variables represented in a graph are independent. The algorithm can show that either two nodes in a graph are independent OR that they are not necessarily independent. The Bayes Ball algorithm can not show that two nodes are dependent. In other word it provides some rules which enables us to do this task using the graph without the need to use the probability distributions. The algorithm will be discussed further in later parts of this section. <br />
<br />
===Canonical Graphs===<br />
In order to understand the Bayes Ball algorithm we need to first introduce 3 canonical graphs. Since our graphs are acyclic, we can represent them using these 3 canonical graphs. <br />
<br />
====Markov Chain (also called serial connection)====<br />
In the following graph (Figure 8 X is independent of Z given Y. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Markov.png|thumb|right|Fig.8 Markov chain.]]<br />
<br />
We can prove this independence: <br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\ <br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
Where<br />
<br />
<center><math>\begin{matrix}<br />
P(X,Y) & = & \displaystyle \sum_Z P(X,Y,Z) \\<br />
& = & \displaystyle \sum_Z P(X)P(Y|X)P(Z|Y) \\<br />
& = & P(X)P(Y | X) \displaystyle \sum_Z P(Z|Y) \\<br />
& = & P(X)P(Y | X)\\<br />
\end{matrix}</math></center><br />
<br />
Markov chains are an important class of distributions with applications in communications, information theory and image processing. They are suitable to model memory in phenomenon. For example suppose we want to study the frequency of appearance of English letters in a text. Most likely when "q" appears, the next letter will be "u", this shows dependency between these letters. Markov chains are suitable model this kind of relations. <br />
[[File:Markovexample.png|thumb|right|Fig.8a Example of a Markov chain.]]<br />
Markov chains play a significant role in biological applications. It is widely used in the study of carcinogenesis (initiation of cancer formation). A gene has to undergo several mutations before it becomes cancerous, which can be addressed through Markov chains. An example is given in Figure 8a which shows only two gene mutations.<br />
<br />
====Hidden Cause (diverging connection)====<br />
In the Hidden Cause case we can say that X is independent of Z given Y. In this case Y is the hidden cause and if it is known then Z and X are considered independent. <br />
<br />
We say that: <math>X</math> <math>\perp</math> <math>Z</math> <math>|</math> <math>Y</math><br />
<br />
[[File:Hidden.png|thumb|right|Fig.9 Hidden cause graph.]]<br />
<br />
The proof of the independence: <br />
<br />
<center><math>\begin{matrix}<br />
P(Z|X,Y) & = & \frac{P(X,Y,Z)}{P(X,Y)}\\<br />
& = & \frac{P(X)P(Y|X)P(Z|Y)}{P(X)P(Y|X)}\\<br />
& = & P(Z|Y)<br />
\end{matrix}</math></center><br />
<br />
The Hidden Cause case is best illustrated with an example: <br /><br />
<br />
[[File:plot44.png|thumb|right|Fig.10 Hidden cause example.]]<br />
<br />
In Figure 10 it can be seen that both "Shoe Size" and "Grey Hair" are dependant on the age of a person. The variables of "Shoe size" and "Grey hair" are dependent in some sense, if there is no "Age" in the picture. Without the age information we must conclude that those with a large shoe size also have a greater chance of having gray hair. However, when "Age" is observed, there is no dependence between "Shoe size" and "Grey hair" because we can deduce both based only on the "Age" variable.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Finally, we look at the third type of canonical graph:<br />
''Explaining-Away Graphs''. This type of graph arises when a<br />
phenomena has multiple explanations. Here, the conditional<br />
independence statement is actually a statement of marginal<br />
independence: <math>X \perp Z</math>. This type of graphs is also called "V-structure" or "V-shape" because of its illustration (Fig. 11). <br />
<br />
[[File:ExplainingAway.png|thumb|right|Fig.11 The missing edge between node X and node Z implies that<br />
there is a marginal independence between the two: <math>X \perp Z</math>.]]<br />
<br />
In these types of scenarios, variables X and Z are independent.<br />
However, once the third variable Y is observed, X and Z become<br />
dependent (Fig. 11).<br />
<br />
To clarify these concepts, suppose Bob and Mary are supposed to<br />
meet for a noontime lunch. Consider the following events:<br />
<br />
<center><math><br />
late =\begin{cases}<br />
1, & \hbox{if Mary is late}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
aliens =\begin{cases}<br />
1, & \hbox{if aliens kidnapped Mary}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
watch =\begin{cases}<br />
1, & \hbox{if Bobs watch is incorrect}, \\<br />
0, & \hbox{otherwise}.<br />
\end{cases}<br />
</math></center><br />
<br />
If Mary is late, then she could have been kidnapped by aliens.<br />
Alternatively, Bob may have forgotten to adjust his watch for<br />
daylight savings time, making him early. Clearly, both of these<br />
events are independent. Now, consider the following<br />
probabilities:<br />
<br />
<center><math>\begin{matrix}<br />
P( late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1 ) \\<br />
P( aliens = 1 ~|~ late = 1, watch = 0 )<br />
\end{matrix}</math></center><br />
<br />
We expect <math>P( late = 1 ) < P( aliens = 1 ~|~ late = 1 )</math> since <math>P(<br />
aliens = 1 ~|~ late = 1 )</math> does not provide any information<br />
regarding Bob's watch. Similarly, we expect <math>P( aliens = 1 ~|~<br />
late = 1 ) < P( aliens = 1 ~|~ late = 1, watch = 0 )</math>. Since<br />
<math>P( aliens = 1 ~|~ late = 1 ) \neq P( aliens = 1 ~|~ late = 1, watch = 0 )</math>, ''aliens'' and<br />
''watch'' are not independent given ''late''. To summarize,<br />
* If we do not observe ''late'', then ''aliens'' <math>~\perp~ watch</math> (<math>X~\perp~ Z</math>)<br />
* If we do observe ''late'', then ''aliens'' <math> ~\cancel{\perp}~ watch ~|~ late</math> (<math>X ~\cancel{\perp}~ Z ~|~ Y</math>)<br />
<br />
===Bayes Ball Algorithm===<br />
<br />
'''Goal:''' We wish to determine whether a given conditional<br />
statement such as <math>X_{A} ~\perp~ X_{B} ~|~ X_{C}</math> is true given a directed graph.<br />
<br />
The algorithm is as follows:<br />
<br />
# Shade nodes, <math>~X_{C}~</math>, that are conditioned on, i.e. they have been observed.<br />
# Assuming that the initial position of the ball is <math>~X_{A}~</math>: <br />
# If the ball cannot reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> must be conditionally independent.<br />
# If the ball can reach <math>~X_{B}~</math>, then the nodes <math>~X_{A}~</math> and <math>~X_{B}~</math> are not necessarily independent.<br />
<br />
The biggest challenge in the ''Bayes Ball Algorithm'' is to<br />
determine what happens to a ball going from node X to node Z as it<br />
passes through node Y. The ball could continue its route to Z or<br />
it could be blocked. It is important to note that the balls are<br />
allowed to travel in any direction, independent of the direction<br />
of the edges in the graph.<br />
<br />
We use the canonical graphs previously studied to determine the<br />
route of a ball traveling through a graph. Using these three<br />
graphs, we establish the Bayes ball rules which can be extended for more<br />
graphical models.<br />
<br />
====Markov Chain (serial connection)====<br />
[[File:BB_Markov.png|thumb|right|Fig.12 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling from X to Z or from Z to X will be blocked at<br />
node Y if this node is shaded. Alternatively, if Y is unshaded,<br />
the ball will pass through.<br />
<br />
In (Fig. 12(a)), X and Z are conditionally<br />
independent ( <math>X ~\perp~ Z ~|~ Y</math> ) while in<br />
(Fig.12(b)) X and Z are not necessarily<br />
independent.<br />
<br />
====Hidden Cause (diverging connection)====<br />
[[File:BB_Hidden.png|thumb|right|Fig.13 (a) When the middle node is shaded, the ball is blocked. (b) When the middle ball is not shaded, the ball passes through Y.]]<br />
<br />
A ball traveling through Y will be blocked at Y if it is shaded.<br />
If Y is unshaded, then the ball passes through.<br />
<br />
(Fig. 13(a)) demonstrates that X and Z are<br />
conditionally independent when Y is shaded.<br />
<br />
====Explaining-Away (converging connection)====<br />
<br />
Unlike the last two cases in which the Bayes ball rule was intuitively understandable, in this case a ball traveling through Y is blocked when Y is UNSHADED!. If Y is<br />
shaded, then the ball passes through. Hence, X and Z are<br />
conditionally independent when Y is unshaded.<br />
<br />
[[File:BB_ExplainingAway.png|thumb|right|Fig.14 (a) When the middle node is shaded, the ball passes through Y. (b) When the middle ball is unshaded, the ball is blocked.]]<br />
<br />
===Bayes Ball Examples===<br />
====Example 1==== <br />
In this first example, we wish to identify the behavior of leaves in the graphical models using two-nodes graphs. Let a ball be<br />
going from X to Y in two-node graphs. To employ the Bayes ball method mentioned above, we have to implicitly add one extra node to the two-node structure since we introduced the Bayes rules for three nodes configuration. We add the third node exactly symmetric to node X with respect to node Y. For example in (Fig. 15) (a) we can think of a hidden node in the right hand side of node Y with a hidden arrow from the hidden node to Y. Then, we are able to utilize the Bayes ball method considering the fact that a ball thrown from X cannot reach Y, and thus it will be blocked. On the contrary, following the same rule in (Fig. 15) (b) turns out that if there was a hidden node in right hand side of Y, a ball could pass from X to that hidden node according to explaining-away structure. Of course, there is no real node and in this case we conventionally say that the ball will be bounced back to node X. <br />
<br />
[[File:TwoNodesExample.png|thumb|right|Fig.15 (a)The ball is blocked at Y. (b)The ball passes through Y. (c)The ball passes through Y. (d) The ball is blocked at Y.]]<br />
<br />
Finally, for the last two graphs, we used the rules of the ''Hidden Cause Canonical Graph'' (Fig. 13). In (c), the ball passes through<br />
Y while in (d), the ball is blocked at Y.<br />
<br />
====Example 2====<br />
Suppose your home is equipped with an alarm system. There are two<br />
possible causes for the alarm to ring:<br />
* Your house is being burglarized<br />
* There is an earthquake<br />
<br />
Hence, we define the following events:<br />
<br />
<center><math><br />
burglary =\begin{cases}<br />
1, & \hbox{if your house is being burglarized}, \\<br />
0, & \hbox{if your house is not being burglarized}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
earthquake =\begin{cases}<br />
1, & \hbox{if there is an earthquake}, \\<br />
0, & \hbox{if there is no earthquake}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
alarm =\begin{cases}<br />
1, & \hbox{if your alarm is ringing}, \\<br />
0, & \hbox{if your alarm is off}.<br />
\end{cases}<br />
</math></center><br />
<br />
<center><math><br />
report =\begin{cases}<br />
1, & \hbox{if a police report has been written}, \\<br />
0, & \hbox{if no police report has been written}.<br />
\end{cases}<br />
</math></center><br />
<br />
<br />
The ''burglary'' and ''earthquake'' events are independent<br />
if the alarm does not ring. However, if the alarm does ring, then<br />
the ''burglary'' and the ''earthquake'' events are not<br />
necessarily independent. Also, if the alarm rings then it is<br />
more possible that a police report will be issued.<br />
<br />
We can use the ''Bayes Ball Algorithm'' to deduce conditional<br />
independence properties from the graph. Firstly, consider figure<br />
(16(a)) and assume we are trying to determine<br />
whether there is conditional independence between the<br />
''burglary'' and ''earthquake'' events. In figure<br />
(\ref{fig:AlarmExample1}(a)), a ball starting at the ''burglary''<br />
event is blocked at the ''alarm'' node.<br />
<br />
[[File:AlarmExample1.PNG|thumb|right|Fig.16 If we only consider the events ''burglary'', ''earthquake'', and ''alarm'', we find that a ball traveling from ''burglary'' to ''earthquake'' would be blocked at the ''alarm'' node. However, if we also consider the ''report''<br />
node, we can find a path between ''burglary'' and ''earthquake.]]<br />
<br />
Nonetheless, this does not prove that the ''burglary'' and<br />
''earthquake'' events are independent. Indeed,<br />
(Fig. 16(b)) disproves this as we have found an<br />
alternate path from ''burglary'' to ''earthquake'' passing<br />
through ''report''. It follows that <math>burglary<br />
~\cancel{\amalg}~ earthquake ~|~ report</math><br />
<br />
====Example 3====<br />
<br />
Referring to figure (Fig. 17), we wish to determine<br />
whether the following conditional probabilities are true:<br />
<br />
<center><math>\begin{matrix}<br />
X_{1} ~\amalg~ X_{3} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{5} ~|~ \{X_{3},X_{4}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:LineExample1.png|thumb|right|Fig.17 Simple Markov Chain graph.]]<br />
<br />
To determine if the conditional probability Eq.\ref{eq:c1} is<br />
true, we shade node <math>X_{2}</math>. This blocks balls traveling from<br />
<math>X_{1}</math> to <math>X_{3}</math> and proves that Eq.\ref{eq:c1} is valid.<br />
<br />
After shading nodes <math>X_{3}</math> and <math>X_{4}</math> and applying the ''Bayes Balls Algorithm}, we find that the ball travelling from <math>X_{1}</math> to <math>X_{5}</math> is blocked at <math>X_{3}</math>. Similarly, a ball going from <math>X_{5}</math> to <math>X_{1}</math> is blocked at <math>X_{4}</math>. This proves that Eq.\ref{eq:c2'' also holds.<br />
<br />
====Example 4====<br />
[[File:ClassicExample1.png|thumb|right|Fig.18 Directed graph.]]<br />
<br />
Consider figure (Fig. 18). Using the ''Bayes Ball Algorithm'' we wish to determine if each of the following<br />
statements are valid:<br />
<br />
<center><math>\begin{matrix}<br />
X_{4} ~\amalg~ \{X_{1},X_{3}\} ~|~ X_{2} \\<br />
X_{1} ~\amalg~ X_{6} ~|~ \{X_{2},X_{3}\} \\<br />
X_{2} ~\amalg~ X_{3} ~|~ \{X_{1},X_{6}\}<br />
\end{matrix}</math></center><br />
<br />
[[File:ClassicExample2.PNG|thumb|right|Fig.19 (a) A ball cannot pass through <math>X_{2}</math> or <math>X_{6}</math>. (b) A ball cannot pass through <math>X_{2}</math> or <math>X_{3}</math>. (c) A ball can pass from <math>X_{2}</math> to <math>X_{3}</math>.]]<br />
<br />
To disprove Eq.\ref{eq:c3}, we must find a path from <math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> when <math>X_{2}</math> is shaded (Refer to Fig. 19(a)). Since there is no route from<br />
<math>X_{4}</math> to <math>X_{1}</math> and <math>X_{3}</math> we conclude that Eq.\ref{eq:c3} is<br />
true.<br />
<br />
Similarly, we can show that there does not exist a path between<br />
<math>X_{1}</math> and <math>X_{6}</math> when <math>X_{2}</math> and <math>X_{3}</math> are shaded (Refer to<br />
Fig.19(b)). Hence, Eq.\ref{eq:c4} is true.<br />
<br />
Finally, (Fig. 19(c)) shows that there is a<br />
route from <math>X_{2}</math> to <math>X_{3}</math> when <math>X_{1}</math> and <math>X_{6}</math> are shaded.<br />
This proves that the statement \ref{eq:c4} is false.<br />
<br />
'''Theorem 2.'''<br /><br />
Define <math>p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}</math> to be the factorization as a multiplication of some local probability of a directed graph.<br /><br />
Let <math>D_{1} = \{ p(x_{v}) = \prod_{i=1}^{n}{p(x_{i} ~|~ x_{\pi_{i}})}\}</math> <br /><br />
Let <math>D_{2} = \{ p(x_{v}):</math>satisfy all conditional independence statements associated with a graph <math>\}</math>.<br /><br />
Then <math>D_{1} = D_{2}</math>.<br />
<br />
====Example 5====<br />
<br />
Given the following Bayesian network (Fig.19 ): Determine whether the following statements are true or false?<br />
<br />
a.) <math>x4\perp \{x1,x3\}</math> <br />
<br />
Ans. True<br />
<br />
b.) <math>x1\perp x6\{x2,x3\}</math> <br />
<br />
Ans. True<br />
<br />
c.) <math>x2\perp x3 \{x1,x6\}</math> <br />
<br />
Ans. False<br />
<br />
== Undirected Graphical Model ==<br />
<br />
Generally, the graphical model is divided into two major classes, directed graphs and undirected graphs. Directed graphs and its characteristics was described previously. In this section we discuss undirected graphical model which is also known as Markov random fields. In some applications there are relations between variables but these relation are bilateral and we don't encounter causality. For example consider a natural image. In natural images the value of a pixel has correlations with neighboring pixel values but this is bilateral and not a causality relations. Markov random fields are suitable to model such processes and have found applications in fields such as vision and image processing.We can define an undirected graphical model with a graph <math> G = (V, E)</math> where <math> V </math> is a set of vertices corresponding to a set of random variables and <math> E </math> is a set of undirected edges as shown in (Fig.20)<br />
<br />
==== Conditional independence ====<br />
<br />
For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering (Fig.21) , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbors. Technically speaking, <math>X_A</math> is independent of <math>X_C</math> given <math>X_B</math> if the set of nodes <math>X_B</math> separates the nodes <math>X_A</math> from the nodes <math>X_C</math>. Hence, if every path from a node in <math>X_A</math> to a node in <math>X_C</math> includes at least one node in <math>X_B</math>, then we claim that <math> X_A \perp X_c | X_B </math>.<br />
<br />
==== Question ====<br />
<br />
Is it possible to convert undirected models to directed models or vice versa?<br />
<br />
In order to answer this question, consider (Fig.22 ) which illustrates an undirected graph with four nodes - <math>X</math>, <math>Y</math>,<math>Z</math> and <math>W</math>. We can define two facts using Bayes ball method:<br />
<br />
<center><math>\begin{matrix}<br />
X \perp Y | \{W,Z\} & & \\<br />
W \perp Z | \{X,Y\} \\<br />
\end{matrix}</math></center><br />
<br />
[[File:UnDirGraphUnconvert.png|thumb|right|Fig.22 There is no directed equivalent to this graph.]]<br />
<br />
It is simple to see there is no directed graph satisfying both conditional independence properties. Recalling that directed graphs are acyclic, converting undirected graphs to directed graphs result in at least one node in which the arrows are inward-pointing(a v structure). Without loss of generality we can assume that node <math>Z</math> has two inward-pointing arrows. By conditional independence semantics of directed graphs, we have <math> X \perp Y|W</math>, yet the <math>X \perp Y|\{W,Z\}</math> property does not hold. On the other hand, (Fig.23 ) depicts a directed graph which is characterized by the singleton independence statement <math>X \perp Y </math>. There is no undirected graph on three nodes which can be characterized by this singleton statement. Basically, if we consider the set of all distribution over <math>n</math> random variables, a subset of which can be represented by directed graphical models while there is another subset which undirected graphs are able to model that. There is a narrow intersection region between these two subsets in which probabilistic graphical models may be represented by either directed or undirected graphs.<br />
<br />
[[File:DirGraphUnconvert.png|thumb|right|Fig.23 There is no undirected equivalent to this graph.]]<br />
<br />
==== Parameterization ====<br />
<br />
Having undirected graphical models, we would like to obtain "local" parameterization like what we did in the case of directed graphical models. For directed graphical models, "local" had the interpretation of a set of node and its parents, <math> \{i, \pi_i\} </math>. The joint probability and the marginals are defined as a product of such local probabilities which was inspired from the chain rule in the probability theory.<br />
In undirected GMs "local" functions cannot be represented using conditional probabilities, and we must abandon conditional probabilities altogether. Therefore, the factors do not have probabilistic interpretation any more, but we can choose the "local" functions arbitrarily. However, any "local" function for undirected graphical models should satisfy the following condition:<br />
- Consider <math> X_i </math> and <math> X_j </math> that are not linked, they are conditionally independent given all other nodes. As a result, the "local" function should be able to do the factorization on the joint probability such that <math> X_i </math> and <math> X_j </math> are placed in different factors.<br />
<br />
It can be shown that definition of local functions based only a node and its corresponding edges (similar to directed graphical models) is not tractable and we need to follow a different approach. Before defining the "local" functions, we have to introduce a new terminology in graph theory called clique. Clique is <br />
a subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C. In addition, maximal clique is a clique where if any other node from the graph G is added to it then the new set is no longer a clique. Consider the undirected graph shown in (Fig. 24), we can list all the cliques as follow:<br />
[[File:graph.png|thumb|right|Fig.24 Undirected graph]]<br />
<br />
- <math> \{X_1, X_3\} </math> <br />
- <math> \{X_1, X_2\} </math><br />
- <math> \{X_3, X_5\} </math><br />
- <math> \{X_2, X_4\} </math> <br />
- <math> \{X_5, X_6\} </math><br />
- <math> \{X_2, X_5\} </math><br />
- <math> \{X_2, X_5, X_6\} </math><br />
<br />
According to the definition, <math> \{X_2,X_5\} </math> is not a maximal clique since we can add one more node, <math> X_6 </math> and still have a clique. Let C be set of all maximal cliques in <math> G(V, E) </math>: <br />
<br />
<center><math><br />
C = \{c_1, c_2,..., c_n\}<br />
</math></center><br />
<br />
where in aforementioned example <math> c_1 </math> would be <math> \{X_1, X_3\} </math>, and so on. We define the joint probability over all nodes as:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})<br />
</math></center><br />
<br />
where <math> \psi_{c_i} (x_{c_i})</math> is an arbitrarily function with some restrictions. This function is not necessarily probability and is defined over each clique. There are only two restrictions for this function, non-negative and real-valued. Usually <math> \psi_{c_i} (x_{c_i})</math> is called potential function. The <math> Z </math> is normalization factor and determined by:<br />
<br />
<center><math><br />
Z = \sum_{X_V} { \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})}<br />
</math></center> <br />
<br />
As a matter of fact, normalization factor, <math> Z </math>, is not very important since in most of the time is canceled out during computation. For instance, to calculate conditional probability <math> P(X_A | X_B) </math>, <math> Z </math> is crossed out between the nominator <math> P(X_A, X_B) </math> and the denominator <math> P(X_B) </math>.<br />
<br />
As was mentioned above, sum-product of the potential functions determines the joint probability over all nodes. Because of the fact that potential functions are arbitrarily defined, assuming exponential functions for <math> \psi_{c_i} (x_{c_i})</math> simplifies and reduces the computations. Let potential function be:<br />
<br />
<br />
<center><math><br />
\psi_{c_i} (x_{c_i}) = exp (- H(x_i))<br />
</math></center> <br />
<br />
the joint probability is given by:<br />
<br />
<center><math><br />
P(x_{V}) = \frac{1}{Z} \prod_{c_i \epsilon C} exp(-H(x_i)) = \frac{1}{Z} exp (- \sum_{c_i} {H_{c_i} (x_i)})<br />
</math></center> <br />
- <br />
<br />
There is a lot of information contained in the joint probability distribution <math> P(x_{V}) </math>. We define 6 tasks listed bellow that we would like to accomplish with various algorithms for a given distribution <math> P(x_{V}) </math>.<br />
<br />
===Tasks:===<br />
<br />
* Marginalization <br /><br />
Given <math> P(x_{V}) </math> find <math> P(x_{A}) </math> where A &sub; V<br /><br />
Given <math> P(x_1, x_2, ... , x_6) </math> find <math> P(x_2, x_6) </math> <br />
* Conditioning <br /><br />
Given <math> P(x_V) </math> find <math>P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}</math> if A &sub; V and B &sub; V .<br />
* Evaluation <br /><br />
Evaluate the probability for a certain configuration. <br />
* Completion <br /><br />
Compute the most probable configuration. In other words, which of the <math> P(x_A|x_B) </math> is the largest for a specific combinations of <math> A </math> and <math> B </math>.<br />
* Simulation <br /><br />
Generate a random configuration for <math> P(x_V) </math> .<br />
* Learning <br /><br />
We would like to find parameters for <math> P(x_V) </math> .<br />
<br />
===Exact Algorithms:===<br />
<br />
To compute the probabilistic inference or the conditional probability of a variable <math>X</math> we need to marginalize over all the random variables <math>X_i</math> and the possible values of <math>X_i</math> which might take long running time. To reduce the computational complexity of preforming such marginalization the next section presents different exact algorithms that find the exact solutions for algorithmic problem in a Polynomial time(fast) which are:<br />
* Elimination<br />
* Sum-Product<br />
* Max-Product<br />
* Junction Tree<br />
<br />
= Elimination Algorithm=<br />
In this section we will see how we could overcome the problem of probabilistic inference on graphical models. In other words, we discuss the problem of computing conditional and marginal probabilities in graphical models.<br />
<br />
== Elimination Algorithm on Directed Graphs==<br />
First we assume that E and F are disjoint subsets of the node indices of a graphical model, i.e. <math> X_E </math> and <math> X_F </math> are disjoint subsets of the random variables. Given a graph G =(V,''E''), we aim to calculate <math> p(x_F | x_E) </math> where <math> X_E </math> and <math> X_F </math> represents evidence and query nodes, respectively. Here and in this section <math> X_F </math> should be only one node; however, later on a more powerful inference method will be introduced which is able to make inference on multi-variables. In order to compute <math> p(x_F | x_E) </math> we have to first marginalize the joint probability on nodes which are neither <math> X_F </math> nor <math> X_E </math> denoted by <math> R = V - ( E U F)</math>. <br />
<br />
<center><math><br />
p(x_E, x_F) = \sum_{x_R} {p(x_E, x_F, x_R)}<br />
</math></center><br />
<br />
which can be further marginalized to yield <math> p(E) </math>:<br />
<br />
<center><math><br />
p(x_E) = \sum_{x_F} {p(x_E, x_F)}<br />
</math></center><br />
<br />
and then the desired conditional probability is given by:<br />
<br />
<center><math><br />
p(x_F|x_E) = \frac{p(x_E, x_F)}{p(x_E)} <br />
</math></center><br />
<br />
== Example ==<br />
<br />
Let assume that we are interested in <math> p(x_1 | \bar{x_6)} </math> in (Fig. 21) where <math> x_6 </math> is an observation of <math> X_6 </math> , and thus we may assume that it is a constant. According to the rule mentioned above we have to marginalized the joint probability over non-evidence and non-query nodes:<br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &\sum_{x_2} \sum_{x_3} \sum_{x_4} \sum_{x_5} p(x_1)p(x_2|x_1)p(x_3|x_1)p(x_4|x_2)p(x_5|x_3)p(\bar{x_6}|x_2,x_5)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) \sum_{x_5} p(x_5|x_3)p(\bar{x_6}|x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1) \sum_{x_4} p(x_4|x_2) m_5(x_2, x_3)<br />
\end{matrix}</math></center><br />
<br />
where to simplify the notations we define <math> m_5(x_2, x_3) </math> which is the result of the last summation. The last summation is over <math> x_5 </math> , and thus the result is only depend on <math> x_2 </math> and <math> x_3</math>. In particular, let <math> m_i(x_{s_i}) </math> denote the expression that arises from performing the <math> \sum_{x_i} </math>, where <math> x_{S_i} </math> are the variables, other than <math> x_i </math>, that appear in the summand. Continuing the derivations we have: <br />
<br />
<center><math>\begin{matrix}<br />
p(x_1, \bar{x_6})& = &p(x_1) \sum_{x_2} p(x_2|x_1) \sum_{x_3} p(x_3|x_1)m_5(x_2,x_3)\sum_{x_4} p(x_4|x_2)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)\sum_{x_3}p(x_3|x_1)m_5(x_2,x_3)\\ <br />
& = & p(x_1) \sum_{x_2} p(x_2|x_1)m_4(x_2)m_3(x_1,x_2)\\<br />
& = & p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
<br />
Therefore, the conditional probability is given by:<br />
<center><math><br />
p(x_1|\bar{x_6}) = \frac{p(x_1)m_2(x_1)}{\sum_{x_1} p(x_1)m_2(x_1)}<br />
</math></center><br />
<br />
At the beginning of our computation we had the assumption which says <math> X_6 </math> is observed, and thus the notation <math> \bar{x_6} </math> was used to express this fact. Let <math> X_i </math> be an evidence node whose observed value is <math> \bar{x_i} </math>, we define an evidence potential function, <math> \delta(x_i, \bar{x_i}) </math>, which its value is one if <math> x_i = \bar{x_i} </math> and zero elsewhere. <br />
This function allows us to use summation over <math> x_6 </math> yielding:<br />
<br />
<center><math><br />
m_6(x_2, x_5) = \sum_{x_6} p(x_6|x_2, x_5) \delta(x_6, \bar{x_6})<br />
</math></center><br />
<br />
We can define an algorithm to make inference on directed graphs using elimination techniques. <br />
Let E and F be an evidence set and a query node, respectively. We first choose an elimination ordering I such that F appears last in this ordering. The following figure shows the steps required to perform the elimination algorithm for probabilistic inference on directed graphs:<br />
<br />
<br />
<code><br />
ELIMINATE (G,E,F)<br/><br />
INITIALIZE (G,F)<br/><br />
EVIDENCE(E)<br/><br />
UPDATE(G)<br/><br />
<br />
NORMALIZE(F)<br/><br />
<br />
INITIALIZE(G,F)<br/><br />
Choose an ordering <math>I</math> such that <math>F</math> appear last <br/><br />
:'''For''' each node <math>X_i</math> in <math>V</math> <br/><br />
::Place <math>p(x_i|x_{\pi_i})</math> on the active list <br/><br />
<br />
:'''End'''<br/><br />
<br />
EVIDENCE(E)<br/><br />
:'''For''' each <math>i</math> in <math>E</math> <br/><br />
::Place <math>\delta(x_i|\overline{x_i})</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Update(G)<br/><br />
:''' For''' each <math>i</math> in <math>I</math> <br/><br />
::Find all potentials from the active list that reference <math>x_i</math> and remove them from the active list <br/><br />
::Let <math>\phi_i(x_Ti)</math> denote the product of these potentials <br/><br />
::Let <math>m_i(x_Si)=\sum_{x_i}\phi_i(x_Ti)</math> <br/><br />
::Place <math>m_i(x_Si)</math> on the active list <br/><br />
:'''End''' <br/><br />
<br />
Normalize(F) <br/><br />
:<math> p(x_F|\overline{x_E})</math> &larr; <math>\phi_F(x_F)/\sum_{x_F}\phi_F(x_F)</math><br/><br />
<br />
</code><br />
<br />
'''Example:''' <br /><br />
For the graph in figure 21 <math>G =(V,''E'')</math>. Consider once again that node <math>x_1</math> is the query node and <math>x_6</math> is the evidence node. <br /><br />
<math>I = \left\{6,5,4,3,2,1\right\}</math> (1 should be the last node, ordering is crucial)<br /><br />
[[File:ClassicExample1.png|thumb|right|Fig.21 Six node example.]]<br />
We must now create an active list. There are two rules that must be followed in order to create this list. <br />
<br />
# For i<math>\in{V}</math> place <math>p(x_i|x_{\pi_i})</math> in active list. <br />
# For i<math>\in</math>{E} place <math>\delta(x_i|\overline{x_i})</math> in active list. <br />
<br />
Here, our active list is:<br />
<math> p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_4|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }</math><br />
<br />
We first eliminate node <math>X_6</math>. We place <math>m_{6}(x_2,x_5)</math> on the active list, having removed <math>X_6</math>. We now eliminate <math>X_5</math>. <br />
<br />
<center><math> \underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)} </math></center><br />
<br />
Likewise, we can also eliminate <math>X_4, X_3, X_2</math>(which yields the unnormalized conditional probability <math>p(x_1|\overline{x_6})</math> and <math>X_1</math>. Then it yields <math>m_1 = \sum_{x_1}{\phi_1(x_1)}</math> which is the normalization factor, <math>p(\overline{x_6})</math>.<br />
<br />
==Elimination Algorithm on Undirected Graphs==<br />
<br />
[[File:graph.png|thumb|right|Fig.22 Undirected graph G']]<br />
<br />
The first task is to find the maximal cliques and their associated potential functions. <br /><br />
maximal clique: <math>\left\{x_1, x_2\right\}</math>, <math>\left\{x_1, x_3\right\}</math>, <math>\left\{x_2, x_4\right\}</math>, <math>\left\{x_3, x_5\right\}</math>, <math>\left\{x_2,x_5,x_6\right\}</math> <br /><br />
potential functions: <math>\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}</math> and <math>\varphi{(x_2,x_3,x_6)}</math> <br />
<br />
<math> p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*) </math><br />
<br />
<math>p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}<br />
</math><br />
<br />
The <math>\frac{1}{Z}</math> looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the <math>\frac{1}{Z}</math> term. So in this case we can just cancel it. <br /><br />
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.<br />
The algorithm used to eliminate nodes in an undirected graph is:<br />
<br />
<br />
<code><br />
<br/><br />
<br />
UndirectedGraphElimination(G,l)<br />
:For each node <math>X_i</math> in <math>I</math><br />
::Connect all of the remaining neighbours of <math>X_i</math><br />
::Remove <math>X_i</math> from the graph <br />
:End <br />
<br />
<br/><br />
</code><br />
<br />
<br />
'''Example: ''' <br /><br />
For the graph G in figure 24 <br /><br />
when we remove x1, G becomes as in figure 25 <br /><br />
while if we remove x2, G becomes as in figure 26<br />
<br />
[[File:ex.png|thumb|right|Fig.24 ]]<br />
[[File:ex2.png|thumb|right|Fig.25 ]]<br />
[[File:ex3.png|thumb|right|Fig.26 ]]<br />
<br />
An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. But if we try to remove another node the complexity is significantly increased. The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly. If we remove one of the leaf first, then the largest clique is two and computational complexity is of order <math>N^2</math>. And removing the center node gives the largest clique size to be five and complexity is of order <math>N^5</math>. Hence, it is very hard to find an optimal ordering, due to which this is an NP problem.<br />
<br />
==Moralization==<br />
So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph. <br />
<br />
To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique. <br />
<br />
After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected. <br />
<br />
The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph <math> P(x_i|\pi_{x_i}) </math> are the same as the mass functions from the undirected graph. <math> \psi_{c_i}(c_{x_i}) </math><br />
<br />
'''Example:'''<br /><br />
I = <math>\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}</math><br /><br />
When we moralize the directed graph in figure 27, we obtain the<br />
undirected graph in figure 28.<br />
<br />
[[File:moral.png|thumb|right|Fig.27 Original Directed Graph]]<br />
[[File:moral3.png|thumb|right|Fig.28 Moral Undirected Graph]]<br />
<br />
=Elimination Algorithm on Trees=<br />
<br />
<br />
'''Definition of a tree:'''<br /><br />
A tree is an undirected graph in which any two vertices are connected by exactly one simple path. In other words, any connected graph without cycles is a tree.<br />
<br />
If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree.<br />
<br />
==Belief Propagation Algorithm (Sum Product Algorithm)==<br />
<br />
One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees. <br />
<br />
For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig.29). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig.30). <br />
<br />
[[File:UnDirTree.png|thumb|right|Fig.29 Undirected tree]]<br />
[[File:Dir_Tree.png|thumb|right|Fig.30 Directed tree]]<br />
<br />
For the undirected graph <math>G(v, \varepsilon)</math> (Fig.30) we can write the joint probability distribution function in the following way.<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)</math></center><br />
<br />
We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties. <br /><br />
Take the above example (Fig.30) of a directed tree. We can write the joint probability distribution function as: <br />
<center><math> P(x_v) = P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules.<br />
\begin{thinlist}<br />
* If <math>\gamma</math> is the root then: <math> \psi(x_\gamma) = P(x_\gamma) </math>.<br />
* If <math>\gamma</math> is NOT the root then: <math> \psi(x_\gamma) = 1 </math>.<br />
* If <math>\left\lbrace i \right\rbrace</math> = <math>\pi_j</math> then: <math> \psi(x_i, x_j) = P(x_j | x_i) </math>.<br />
\end{thinlist}<br />
So now we can rewrite the above equation for (Fig.30) as:<br />
<center><math> P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5) </math></center><br />
<center><math> = \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2) </math></center><br />
<br />
==Elimination Algorithm on a Tree==<br />
<br />
[[File:fig1.png|thumb|right|Fig.31 Message-passing in Elimination Algorithm]]<br />
<br />
We will derive the Sum-Product algorithm from the point of view<br />
of the Eliminate algorithm. To marginalize <math>x_1</math> in<br />
Fig.31,<br />
<center><math>\begin{matrix}<br />
p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\<br />
<br />
&=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\<br />
<br />
&=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\<br />
<br />
&=&p(x_1)m_2(x_1)<br />
\end{matrix}</math></center><br />
where,<br />
<center><math>\begin{matrix}<br />
m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\<br />
m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\<br />
m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}</math></center><br />
which is essentially (potential of the node)<math>\times</math>(potential of<br />
the edge)<math>\times</math>(message from the child).<br />
<br />
The term "<math>m_{ji}(x_i)</math>" represents the intermediate factor between the eliminated variable, ''j'', and the remaining neighbor of the variable, ''i''. Thus, in the above case, we will use <math>m_{53}(x_3)</math> to denote <math>m_5(x_3)</math>, <math>m_{42}(x_2)</math> to denote<br />
<math>m_4(x_2)</math>, and <math>m_{32}(x_2)</math> to denote <math>m_3(x_2)</math>. We refer to the<br />
intermediate factor <math>m_{ji}(x_i)</math> as a "message" that ''j''<br />
sends to ''i''. (Fig. \ref{fig:TreeStdEx})<br />
<br />
In general,<center><math>\begin{matrix}<br />
m_{ji}=\sum_{x_i}(<br />
\psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj})<br />
\end{matrix}</math></center><br />
<br />
Note: It is important to know that BP algorithm gives us the exact solution only if the graph is a tree, however experiments have shown that BP leads to acceptable approximate answer even when the graphs has some loops.<br />
<br />
==Elimination To Sum Product Algorithm==<br />
<br />
[[File:fig2.png|thumb|right|Fig.32 All of the messages needed to compute all singleton<br />
marginals]]<br />
<br />
The Sum-Product algorithm allows us to compute all<br />
marginals in the tree by passing messages inward from the leaves of<br />
the tree to an (arbitrary) root, and then passing it outward from the<br />
root to the leaves, again using the above equation at each step. The net effect is<br />
that a single message will flow in both directions along each edge.<br />
(See Fig.32) Once all such messages have been computed using the above equation,<br />
we can compute desired marginals. One of the major advantages of this algorithm is that<br />
messages can be reused which reduces the computational cost heavily.<br />
<br />
As shown in Fig.32, to compute the marginal of <math>X_1</math> using<br />
elimination, we eliminate <math>X_5</math>, which involves computing a message<br />
<math>m_{53}(x_3)</math>, then eliminate <math>X_4</math> and <math>X_3</math> which involves<br />
messages <math>m_{32}(x_2)</math> and <math>m_{42}(x_2)</math>. We subsequently eliminate<br />
<math>X_2</math>, which creates a message <math>m_{21}(x_1)</math>.<br />
<br />
Suppose that we want to compute the marginal of <math>X_2</math>. As shown in<br />
Fig.33, we first eliminate <math>X_5</math>, which creates <math>m_{53}(x_3)</math>, and<br />
then eliminate <math>X_3</math>, <math>X_4</math>, and <math>X_1</math>, passing messages<br />
<math>m_{32}(x_2)</math>, <math>m_{42}(x_2)</math> and <math>m_{12}(x_2)</math> to <math>X_2</math>.<br />
<br />
[[File:fig3.png|thumb|right|Fig.33 The messages formed when computing the marginal of <math>X_2</math>]]<br />
<br />
Since the messages can be "reused", marginals over all possible<br />
elimination orderings can be computed by computing all possible<br />
messages which is small in numbers compared to the number of<br />
possible elimination orderings.<br />
<br />
The Sum-Product algorithm is not only based on the above equation, but also ''Message-Passing Protocol''.<br />
'''Message-Passing Protocol''' tells us that a node can<br />
send a message to a neighboring node when (and only when) it has<br />
received messages from all of its other neighbors.<br />
<br />
<br />
<br />
===For Directed Graph===<br />
Previously we stated that:<br />
<center><math><br />
p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),<br />
</math></center><br />
<br />
Using the above equation (\ref{eqn:Marginal}), we find the marginal of <math>\bar{x}_E</math>.<br />
<center><math>\begin{matrix}<br />
p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\<br />
&=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E)<br />
\end{matrix}</math></center><br />
<br />
Now we denote:<br />
<center><math><br />
p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)<br />
</math></center><br />
<br />
Since the sets, ''F'' and ''E'', add up to <math>\mathcal{V}</math>,<br />
<math>p(x_v)</math> is equal to <math>p(x_F,x_E)</math>. Thus we can substitute the<br />
equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:<br />
<center><math>\begin{matrix}<br />
p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\<br />
p(\bar{x}_E) = \sum_{x_v}p^E(x_v)<br />
\end{matrix}</math></center><br />
<br />
We are interested in finding the conditional probability. We<br />
substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional<br />
probability equation.<br />
<br />
<center><math>\begin{matrix}<br />
p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\<br />
&=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)}<br />
\end{matrix}</math></center><br />
<math>p^E(x_v)</math> is an unnormalized version of conditional probability,<br />
<math>p(x_F|\bar{x}_E)</math>. <br />
<br />
===For Undirected Graphs===<br />
<br />
We denote <math>\psi^E</math> to be:<br />
<center><math>\begin{matrix}<br />
\psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\<br />
\psi^E(x_i) = \psi(x_i),& & otherwise<br />
\end{matrix}</math></center><br />
<br />
==Max-Product==<br />
Because multiplication distributes over max as well as sum:<br />
<br />
<center><math>\begin{matrix}<br />
max(ab,ac) = a & \max(b,c)<br />
\end{matrix}</math></center><br />
<br />
Formally, both the sum-product and max-product are commutative semirings.<br />
<br />
We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max. <br /><br />
<br />
[[File:suks.png|thumb|right|Fig.33 Max Product Example]]<br />
<br />
<center><math>\begin{matrix}<br />
\max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\<br />
& = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)}<br />
\end{matrix}</math></center><br />
<br />
<math>p(x_F|\bar{x}_E)</math><br />
<br />
<center><math>m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<center><math>m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}</math></center><br />
<br />
<br />
'''Example:''' <br />
Consider the graph in Figure.33. <br />
<center><math> m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}} </math></center><br />
<center><math> m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}} </math></center><br />
<br />
==Maximum configuration==<br />
We would also like to find the value of the <math>x_i</math>s which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax. <br /><br />
<math>m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}</math><br /><br />
<math>\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}</math><br /><br />
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.<br />
<br />
=Parameter Learning=<br />
<br />
The goal of graphical models is to build a useful representation of the input data to understand and design learning algorithm. Thereby, graphical model provide a representation of joint probability distribution over nodes (random variables). One of the most important features of a graphical model is representing the conditional independence between the graph nodes. This is achieved using local functions which are gathered to compose factorizations. Such factorizations, in turn, represent the joint probability distributions and hence, the conditional independence lying in such distributions. However that doesn’t mean the graphical model represent all the necessary independence assumptions. <br />
<br />
==Basic Statistical Problems==<br />
In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: <br />
<br />
* Regression<br />
* Classification<br />
* Clustering<br />
* Density Estimation<br />
<br />
<br />
<br />
===Regression===<br />
In regression we have a set of data points <math> (x_i, y_i) </math> for <math> i = 1...n </math> and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig.34) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables. <br />
<br />
[[File:regression.png|thumb|right|Fig.34 Regression]]<br />
<br />
Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. <br />
<math>P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}</math><br />
<br />
===Classification===<br />
In classification we also have a set of data points which each contain set features <math> (x_1, x_2,.. ,x_i) </math> for <math> i = 1...n </math> and we would like to assign the data points into one of a given number of classes y. Consider the example in (Fig.35) where two sets of features have been divided into the set + and - by a line. The purpose of classification is to find this line and then place any new points into one group or the other. <br />
<br />
[[File:Classification.png|thumb|right|Fig.35 Classify Points into Two Sets]]<br />
<br />
We would like to obtain the probability distribution of the following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of x and Y are x and y. <br />
<center><math> P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}} </math></center><br />
<br />
===Clustering===<br />
Clustering is unsupervised learning method that assign different a set of data point into a group or cluster based on the similarity between the data points. Clustering is somehow like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of c. <br />
<center><math> P(c|x)=\frac{P(c,x)}{P(x)}\ \ c\ unknown </math></center><br />
<br />
===Density Estimation===<br />
Density Estimation is the problem of modeling a probability density function p(x), given a finite number of data points<br />
drawn from that density function. <br />
<center><math> P(y|x)=\frac{P(y,x)}{P(x)} \ \ x\ unknown </math></center><br />
<br />
We can use graphs to represent the four types of statistical problems that have been introduced so far. The first graph (Fig.36(a)) can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig.36(b)) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering and Density Estimation situation. <br />
<br />
[[File:RegClass.png|thumb|right|Fig.36(a) Regression or classification (b) Clustering or Density Estimation]]<br />
<br />
<br />
==Likelihood Function==<br />
Recall that the probability model <math>p(x|\theta)</math> has the intuitive interpretation of assigning probability to X for each fixed value of <math>\theta</math>. In the Bayesian approach this intuition is formalized by treating <math>p(x|\theta)</math> as a conditional probability distribution. In the Frequentist approach, however, we treat <math>p(x|\theta)</math> as a function of <math>\theta</math> for fixed x, and refer to <math>p(x|\theta)</math> as the likelihood function.<br />
<center><math><br />
L(\theta;x)= p(x|\theta)</math></center><br />
where <math>p(x|\theta)</math> is the likelihood L(<math>\theta, x</math>)<br />
<center><math><br />
l(\theta,x)=log(p(x|\theta))<br />
</math></center><br />
where <math>log(p(x|\theta))</math> is the log likelihood <math>l(\theta, x)</math><br />
<br />
Since <math>p(x)</math> in the denominator of Bayes Rule is independent of <math>\theta</math> we can consider it as a constant and we can draw the conclusion that:<br />
<br />
<center><math><br />
p(\theta|x) \propto p(x|\theta)p(\theta)<br />
</math></center><br />
<br />
Symbolically, we can interpret this as follows:<br />
<center><math><br />
Posterior \propto likelihood \times prior<br />
</math></center><br />
<br />
where we see that in the Bayesian approach the likelihood can be<br />
viewed as a data-dependent operator that transforms between the<br />
prior probability and the posterior probability.<br />
<br />
<br />
===Maximum likelihood===<br />
The idea of estimating the maximum is to find the optimum values for the parameters by maximizing a likelihood function form the training data. Suppose in particular that we force the Bayesian to choose a<br />
particular value of <math>\theta</math>; that is, to remove the posterior<br />
distribution <math>p(\theta|x)</math> to a point estimate. Various<br />
possibilities present themselves; in particular one could choose the<br />
mean of the posterior distribution or perhaps the mode.<br />
<br />
<br />
(i) the mean of the posterior (expectation):<br />
<center><math><br />
\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta<br />
</math></center><br />
<br />
is called ''Bayes estimate''.<br />
<br />
OR<br />
<br />
(ii) the mode of posterior:<br />
<center><math>\begin{matrix}<br />
\hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\<br />
&=&argmax_{\theta}p(x|\theta)p(\theta)<br />
\end{matrix}</math></center><br />
<br />
Note that MAP is '''Maximum a posterior'''.<br />
<br />
<center><math> MAP -------> \hat\theta_{ML}</math></center><br />
When the prior probabilities, <math>p(\theta)</math> is taken to be uniform on <math>\theta</math>, the MAP estimate reduces to the maximum likelihood estimate, <math>\hat{\theta}_{ML}</math>.<br />
<br />
<center><math> MAP = argmax_{\theta} p(x|\theta) p(\theta) </math></center><br />
<br />
When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).<br />
<br />
Thus, one has:<br />
<center><math><br />
\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log<br />
p(\theta) \}<br />
</math></center><br />
as an alternative expression for the MAP estimate.<br />
<br />
Here, <math>log (p(x|\theta))</math> is log likelihood and the "penalty" is the<br />
additive term <math>log(p(\theta))</math>. Penalized log likelihoods are widely<br />
used in Frequentist statistics to improve on maximum likelihood<br />
estimates in small sample settings.<br />
<br />
===Example : Bernoulli trials===<br />
<br />
Consider the simple experiment where a biased coin is tossed four times. Suppose now that we also have some data <math>D</math>: <br />e.g. <math>D = \left\lbrace h,h,h,t\right\rbrace </math>. We want to use this data to estimate <math>\theta</math>. The probability of observing head is <math> p(H)= \theta</math> and the probability of observing a tail is <math> p(T)= 1-\theta</math>.<br />
where the conditional probability is <center><math> P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)} </math></center><br />
<br />
We would now like to use the ML technique.Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.<br />
<br />
How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability. <br />
<center><math>L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)</math></center><br />
This is in fact the likelihood that we want to work with. Now let us try to maximise it: <br />
<center><math>\begin{matrix}<br />
l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(P(x_i|\theta)) \\<br />
& = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\<br />
& = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\<br />
\end{matrix}</math></center><br />
Take the derivative and set it to zero: <br />
<br />
<center><math> \frac{\partial l}{\partial\theta} = 0 </math></center><br />
<center><math> \frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0 </math></center><br />
<center><math> \Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta} </math></center><br />
<center><math> \frac{NH}{\theta} = \frac{NT}{1-\theta} </math></center> <br />
Where: <br />
NH = number of all the observed of heads <br /><br />
NT = number of all the observed tails <br /><br />
Hence, <math>NT + NH = n</math> <br /><br />
<br />
And now we can solve for <math>\theta</math>: <br />
<br />
<center><math>\begin{matrix}<br />
\theta & = & \frac{(1-\theta)NH}{NT} \\<br />
\theta + \theta\frac{NH}{NT} & = & \frac{NH}{NT} \\<br />
\theta(\frac{NT+NH}{NT}) & = & \frac{NH}{NT} \\<br />
\theta & = & \frac{\frac{NH}{NT}}{\frac{n}{NT}} = \frac{NH}{n}<br />
\end{matrix}</math></center><br />
<br />
===Example : Multinomial trials===<br />
Recall from the previous example that a Bernoulli trial has only two outcomes (e.g. Head/Tail, Failure/Success,…). A Multinomial trial is a multivariate generalization of the Bernoulli trial with K number of possible outcomes, where K > 2. Let <math> p(k) = \theta_k </math> be the probability of outcome k. All the <math>\theta_k</math> parameters must be:<br />
<br />
<math> 0 \leq \theta_k \leq 1</math><br />
<br />
and<br />
<br />
<math> \sum_k \theta_k = 1</math><br />
<br />
Consider the example of rolling a die M times and recording the number of times each of the six die's faces observed. Let <math> N_k </math> be the number of times that face k was observed.<br />
<br />
Let <math>[x^m = k]</math> be a binary indicator, such that the whole term would equals one if <math>x^m = k</math>, and zero otherwise. The likelihood function for the Multinomial distribution is:<br />
<br />
<math>l(\theta; D) = log( p(D|\theta) )</math><br />
<br />
<math>= log(\prod_m \theta_{x^m}^{x})</math><br />
<br />
<math>= log(\prod_m \theta_{1}^{[x^m = 1]} ... \theta_{k}^{[x^m = k]})</math><br />
<br />
<math>= \sum_k log(\theta_k) \sum_m [x^m = k]</math><br />
<br />
<math>= \sum_k N_k log(\theta_k)</math><br />
<br />
Take the derivatives and set it to zero:<br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = 0</math><br />
<br />
<math>\frac{\partial l}{\partial\theta_k} = \frac{N_k}{\theta_k} - M = 0</math><br />
<br />
<math>\Rightarrow \theta_k = \frac{N_k}{M}</math><br />
<br />
<br />
===Example: Univariate Normal===<br />
Now let us assume that the observed values come from normal distribution. <br /><br />
\includegraphics{images/fig4Feb6.eps}<br />
\newline<br />
Our new model looks like:<br />
<center><math>P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}} </math></center><br />
Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function. <br />
<center><math> L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}</math></center> <br />
<center><math> \max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma} </math></center><br />
Now, since our parameter theta is in fact a set of two parameters, <br />
<center><math>\theta = (\mu, \sigma)</math></center><br />
we must estimate each of the parameters separately. <br />
<center><math>\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i</math></center><br />
<center><math>\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0</math></center><br />
<center><math> \Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2 </math></center><br />
<br />
==Discriminative vs Generative Models==<br />
[[File:GenerativeModel.png|thumb|right|Fig.36i Generative Model represented in a graph.]]<br />
(beginning of Oct. 18)<br />
<br />
If we call the evidence/features variable <math>X\,\!</math> and the output variable <math>Y\,\!</math>, one way to model a classifier is to base the definition of the joint distribution on <math>p(X|Y)\,\!</math> and another one is to do it based on <math>p(Y|X)\,\!</math>. The first of this two approaches is called generative, as the second one is called discriminative. The philosophy behind this naming might be clear by looking at the way each conditional probability function tries to present a model. Based on the experience, using generative models (e.g. Bayes Classifier) in many cases leads to taking some assumptions which may not be valid according to the nature of the problem and hence make a model depart from the primary intentions of a design. This may not be the case for discriminative models (e.g. Logistic Regression), as they do not depend on many assumptions besides the given data.<br />
<br />
[[File:DiscriminativeModel.png|thumb|right|Fig.36ii Discriminative Model represented in a graph.]]<br />
<br />
Given <math>N</math> variables, we have a full joint distribution in a generative model. In this model we can identify the conditional independencies between various random variables. This joint distribution can be factorized into various conditional distributions. One can also define the prior distributions that affect the variables.<br />
Here is an example that represents generative model for classification in terms of a directed graphical model shown in Figure 36i. The following have to be estimated to fit the model: conditional probability, i.e. <math>P(Y|X)</math>, marginal and the prior probabilities. Examples that use generative approaches are Hidden Markov models, Markov random fields, etc. <br />
<br />
Discriminative approach used in classification is displayed in terms of a graph in Figure 36ii. However, in discriminative models the dependencies between various random variables are not explicitly defined. We need to estimate the conditional probability, i.e. <math>P(X|Y)</math>. Examples that use discriminative approach are neural networks, logistic regression, etc.<br />
<br />
Sometimes, it becomes very hard to compute <math>P(X|Y)</math> if <math>X</math> is of higher dimensional (like data from images). Hence, we tend to omit the intermediate step and calculate directly. In higher dimensions, we assume that they are independent to that it does not over fit.<br />
<br />
==Markov Models==<br />
Markov models, introduced by Andrey (Andrei) Andreyevich Markov as a way of modeling Russian poetry, are known as a good way of modeling those processes which progress over time or space. Basically, a Markov model can be formulated as follows:<br />
<br />
<center><math><br />
y_t=f(y_{t-1},y_{t-2},\ldots,y_{t-k})<br />
</math></center><br />
<br />
Which can be interpreted by the dependence of the current state of a variable on its last <math>k</math> states. (Fig. XX)<br />
<br />
Maximum Entropy Markov model is a type of Markov model, which makes the current state of a variable dependant on some global variables, besides the local dependencies. As an example, we can define the sequence of words in a context as a local variable, as the appearance of each word depends mostly on the words that have come before (n-grams). However, the role of POS (part of speech tagging) can not be denied, as it affect the sequence of words very clearly. In this example, POS are global dependencies, whereas last words in a row are those of local.<br />
===Markov Chain===<br />
The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property suggests that the distribution for this variable depends only on the distribution of the previous state.<br />
<br />
==Hidden Markov Models (HMM)==<br />
Markov models fail to address a scenario, in which, a series of states cannot be observed except they are probabilistic function of those hidden states. Markov models are extended in these scenarios where observation is a probability function of state. An example of a HMM is the formation of DNA sequence. There is a hidden process that generates amino acids depending on some probabilities to determine an exact sequence. Main questions that can be answered with HMM are the following:<br />
<br />
* How can one estimate the probability of occurrence of an observation sequence?<br />
* How can we choose the state sequence such that the joint probability of the observation sequence is maximized?<br />
* How can we describe an observation sequence through the model parameters?<br />
<br />
[[File:HMMorder1.png|thumb|right|Fig.37 Hidden Markov model of order 1.]]<br />
<br />
An example of HMM of oder 1 is displayed in Figure 37. Most common example is in the study of gene analysis or gene sequencing, and the joint probability is given by<br />
<center><math> P(y1,y2,y3,y4,y5) = P(y1)P(y2|y1)P(y3|y2)P(y4|y3)P(y5|y4). </math></center><br />
<br />
[[File:HMMorder2.png|thumb|right|Fig.38 Hidden Markov model of order 2.]]<br />
<br />
HMM of order 2 is displayed in Figure 38. Joint probability is given by<br />
<center><math> P(y1,y2,y3,y4) = P(y1,y2)P(y3|y1,y2)P(y4|y2,y3). </math></center><br />
<br />
In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure 39. <br />
<br />
P.S. The latent variables in Figure 39 are discrete as we are discussing HMM. Factor analysis is concerned, instead of HMM, when such variables are continuous.<br />
[[File:HMM.png|thumb|right|Fig.39 Hidden Markov Model]]<br />
<br />
In the model the <math>q_i</math>s are the hidden layer and the <math>y_i</math>s are the output layer. The <math>y_i</math>s are shaded because they have been observed. The parameters that need to be estimated are <math> \theta = (\pi, A, \eta)</math>. Where <math>\pi</math> represents the starting state for <math>q_0</math>. In general <math>\pi_i</math> represents the state that <math>q_i</math> is in. The matrix <math>A</math> is the transition matrix for the states <math>q_t</math> and <math>q_{t+1}</math> and shows the probability of changing states as we move from one step to the next. Finally, <math>\eta</math> represents the parameter that decides the probability that <math>y_i</math> will produce <math>y^*</math> given that <math>q_i</math> is in state <math>q^*</math>. <br /><br />
For the HMM our data comes from the output layer: <br />
<center><math> Data = (y_{0i}, y_{1i}, y_{2i}, ... , y_{Ti}) \text{ for } i = 1...n </math></center><br />
We can now write the joint pdf as:<br />
<center><math> P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can use <math>a_{ij}</math> to represent the i,j entry in the matrix A. We can then define:<br />
<center><math> P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} </math></center><br />
We can also define:<br />
<center><math> p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i} </math></center><br />
Now, if we take Y to be multinomial we get:<br />
<center><math> P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j} </math></center><br />
The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:<br />
<center><math> P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t) </math></center><br />
We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so <math>n=1</math>. Take the log of our pdf and we get:<br />
<center><math> l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t)) </math></center><br />
Then we take the expectation for the E-Step:<br />
<center><math> E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))] </math></center><br />
If we continue with our multinomial example then we would get:<br />
<center><math> \sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij}) </math></center><br />
So now we need to calculate <math>E[q_0^i]</math> and <math> E[q_i^t q_j^{t+1}] </math> in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities. <br /><br />
Let <math> \gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)}) </math>. <br /><br />
Let <math> \xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)}) </math> .<br /><br />
We could use the sum product algorithm to calculate the above equations.<br />
<br />
==Graph Structure==<br />
Up to this point, we have covered many topics about graphical models, assuming that the graph structure is given. However, finding an optimal structure for a graphical model is a challenging problem all by itself. In this section, we assume that the graphical model that we are looking for is expressible in a form of tree. And to remind ourselves of the concept of tree, an undirected graph will be a tree, if there is one and only one path between each pair of nodes. For the case of directed graphs, however, on top of the mentioned condition, we also need to check if all the nodes have at most one parent - which is in other words no explaining away kinds of structures.<br />
<br />
Firstly, let us show you how it does not affect the joint distribution function, if a graph is directed or undirected, as long as it is tree. Here is how one can write down the joint ditribution of the graph of Fig. XX.<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2).\,\!<br />
</math></center><br />
<br />
Now, if we change the direction of the connecting edge between <math>x_1</math> and <math>x_2</math>, we will have the graph of Fig. XX and the corresponding joint distribution function will change as follows:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_2)p(x_1|x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which can be simply re-written as:<br />
<br />
<center><math><br />
p(x_1,x_2,x_3,x_4)=p(x_1,x_2)p(x_3|x_2)p(x_4|x_2),\,\!<br />
</math></center><br />
<br />
which is the same as the first function. We will depend on this very simplistic observation and leave the proof to the enthusiast reader.<br />
<br />
===Maximum Likelihood Tree===<br />
We want to compute the tree that maximizes the likelihood for a given set of data. Optimality of a tree structure can be discussed in terms of likelihood of the set of variables. By doing so, we can define a fully connected, weighted graph by setting the edge weights to the likelihood of the occurrence of the connecting nodes/random variables and then by running the maximum weight spanning tree. Here is how it works.<br />
<br />
We have defined the joint distribution as follows: <br />
<center><math><br />
p(x)=\prod_{i\in V}p(x_i)\prod_{i,j\in E}\frac{p(x_i,x_j)}{p(x_i)p(x_j)}<br />
</math></center><br />
Where <math>V</math> and <math>E</math> are respectively the sets of vertices and edges of the corresponding graph. This holds as long as the tree structure for the graphical model is concerned, as the dependence of <math>x_i</math> on <math>x_j</math> has been chosen arbitrarily and this is not the case for non-tree graphical models.<br />
<br />
Maximizing the joint probability distribution over the given set of data samples <math>X</math> with the objective of parameter estimation we will have (MLE):<br />
<center><math><br />
L(\theta|X):p(X|\theta)=\prod_{i\in V}p(x_i|\theta)\prod_{i,j\in E}\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
And by taking the logarithm of <math>L(\theta|X)</math> (log-likelihood), we will get:<br />
<br />
<center><math><br />
l=\sum_{i\in V}\log p(x_i)+\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
The first term in the above equation does not convey anything about the topology or the structure of the tree as it is defined over single nodes. As much as the optimization of the tree structure is concerned, the probability of the single nodes may not play any role in the optimization, so we can define the cost function for our optimization problem as such:<br />
<br />
<center><math><br />
l_r=\sum_{i,j\in E}\log\frac{p(x_i,x_j|\theta)}{p(x_i|\theta)p(x_j|\theta)}<br />
</math></center><br />
<br />
Where the sub r is for reduced. By replacing the probability functions with the frequency of occurence of each state, we will have:<br />
<br />
<center><math><br />
l_r=\sum_{s,t}N_{ijst}\log\frac{N_{ijst}}{N_{is}N_{jt}}<br />
</math></center><br />
<br />
Where we have assumed that <math>p(x_i,x_j)=\frac{N_{ijst}}{N}</math>, <math>p(x_i)=\frac{N_{is}}{N}</math>, and <math>p(x_j)=\frac{N_{jt}}{N}</math>. The resulting statement is the definition of the mutual information of the two random variables <math>x_i</math> and <math>x_j</math>, where the former is in state <math>s</math> and the latter in <math>t</math>.<br />
<br />
This is how it has been figured out how to define weights for the edges of a fully connected graph. Now, it is required to run the maximum weight spanning tree on the resulting graph to find the optimal structure for the tree.<br />
It is important to note that before developing graphical models this problem has been solved in graph theory. Here our problem was completely a probabilistic problem but using graphical models we could find an equivalent graph theory problem. This show how graphical models can help us to use powerful graph theory tools to solve probabilistic problems.<br />
<br />
==Latent Variable Models==<br />
(beginning of Oct. 20) Assuming that we have thoroughly observed, or even identified all of the random variables of a model can be a very naive assumption, as one can think of many instances of contrary cases. To make a model as rich as possible -there is always a trade-off between richness and complexity, so we do not like to inject unnecessary complexity to our model either- the concept of latent variables has been introduced to the graphical models.<br />
<br />
First let's define latent variables. Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.<br />
<br />
Depending on the position of an unobserved variable, <math>z</math>, we take different actions. If there is no variable conditioned on <math>z</math>, we can integrate/sum it out and it will never be noticed, as it is not either an evidence or a querey. However, we will require to model an unobserved variable like <math>z</math>, if it is bound to some conditions.<br />
<br />
The use of latent variables makes a model harder to analyze and to learn. The use of log-likelihood used to make the target function easier to obtain, as the log of product will change to sum of logs, but this will not be the case, when one introduces latent variables to a model, as the resulting joint probability function comes with a sum, which makes the effect of log on product impossible.<br />
<br />
<center><math><br />
l(\theta,D) = \log\sum_{z}p(x,z|\theta).\,<br />
</math></center><br />
<br />
As an example of latent variables, one can think of a mixture density model. There are different models come together to build the final model, but it takes one more random variable to say which one of those models to use at the presence of each new sample point. This will affect both the learning and recalling phases.<br />
<br />
== EM Algorithm ==<br />
Oct. 25th<br />
=== Introduction ===<br />
In last section the graphical models with latent variables were discussed. It was mentioned that, for example, if fitting typical distributions on a data set is too complex, one may think of modeling the data set using a mixture of famous distribution such as Gaussian. Therefore, a hidden variable is needed to determine weight of each Gaussian model. Parameter learning in graphical models with latent variables is more complicated in comparison with the models with no latent variable.\\<br />
<br />
Consider Fig.XX which depicts a simple graphical model with two nodes. As the convention, unobserved variable <math> Z </math> is unshaded. To compare complexity between fully observed models and the models with hidden variables, lets suppose variables <math> Z </math> and <math> X </math> are both observed. We may like to interpret this problem as a classification problem where <math> Z </math> is class label and <math> X </math> is the data set. In addition, we assume the distribution over members of each group is Gaussian. Thus, the learning process is to determine label <math> Z </math> out of the training set by maximizing the posterior: <br />
<br />
<center><math><br />
P(z|x) = \frac{P(x|z)P(z)}{P(x)},<br />
</math></center> <br />
<br />
For simplicity, we assume there are two class generating the data set <math> X</math>, <math> Z = 1 </math> and <math> Z = 0 </math>. Therefore one can easily find the posterior <math> P(z=1|x) </math> using:<br />
<br />
<center><math><br />
P(z = 1|x) = \frac{N(x; \mu_1, \sigma_1)}{N(x; \mu_1, \sigma_1)\pi_1 + N(x; \mu_0, \sigma_0)\pi_0},<br />
</math></center> <br />
<br />
On the contrary, if <math> Z </math> is unknown we are not able to easily write the posterior and consequently parameter estimation is more difficult. In the case of graphical models with latent variables, we first assume the latent variable is somehow known, and thus writing the posterior becomes easy. Then, we are going to make the estimation of <math> Z </math> more accurate. For instance, if the task is to fit a set of data derived from unknown sources with mixtures of Gaussian distribution, we may assume the data is derived from two sources whose distributions are Gaussian. The first estimation might not be accurate, yet we introduce an algorithm by which the estimation is becoming more accurate using an iterative approach. In this section we see how the parameter learning for these graphical models is performed using EM algorithm.<br />
<br />
=== EM Method ===<br />
<br />
EM (Expectation-Maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which computes the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. Suppose the task is to find the parameters for simple graphical model shown in Fig.XX where <math> X </math> and <math> Z </math> are observed and unobserved variables, respectively. The likelihood is given by:<br />
<br />
<center><math><br />
l_c(\theta; x,z) = log P(x,z | \theta)<br />
</math></center> <br />
<br />
<br />
which is called "complete log likelihood". In the above equation the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters <math>\theta_i</math> if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set. <br /><br />
To simplify the problem we define the following type of likelihood:<br />
<br />
<center><math><br />
l(\theta; x) = log(P(x | \theta))<br />
</math></center> <br />
<br />
which is called "incomplete log likelihood". We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral. <br />
<center><math> l(\theta; x) = log(P(x | \theta)) = log(\sum_zP(x, z|\theta)) </math></center><br />
Since the z has not been observed that means that <math>l_c</math> is in fact a random quantity. In that case we can define the expectation of <math>l_c</math> in terms of some arbitrary density function <math>q(z|x)</math>. <br />
<br />
<center><math> l(\theta;x) = P(x|\theta) = log \sum_z P(x,z|\theta) = log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} = \sum_z q(z|x)log\frac{P(x, z|\theta)}{q(z|x)} </math></center><br />
<br />
====Jensen's Inequality====<br />
In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem. <br />
<br />
For any '''concave''' function f: <br />
<center><math> f(\alpha x_1 + (1-\alpha)x_2) \geqslant \alpha f(x_1) + (1-\alpha)f(x_2) </math></center> <br />
This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph. <br />
<br />
[[File:JensenIneq.png|thumb|right|Fig.XX Jensen's Inequality]]<br />
<br />
For us it is important that the log function is '''concave''' , and thus:<br />
<br />
<center><math><br />
log \sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)} \geqslant \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} = F(\theta, q) <br />
</math></center><br />
<br />
The function <math> F (\theta, q) </math> is called the auxiliary function and it is used in the EM algorithm. As seen in above equation <math> F(\theta, q) </math> is the lower bound of the incomplete log likelihood and one way to maximize the incomplete likelihood is to increase its lower bound. For the EM algorithm we have two steps repeating one after the other to give better estimation for <math>q(z|x)</math> and <math>\theta</math>. As the steps are repeated the parameters converge to a local maximum in the likelihood function. <br />
<br />
'''E-Step''' <br />
<center><math> q^{t+1} = argmax_{q} F(\theta^t, q) </math></center><br />
<br />
'''M-Step'''<br />
<center><math> \theta^{t+1} = argmax_{\theta} F(\theta, q^{t+1}) </math></center><br />
<br />
==== M-Step ====<br />
<br />
<center><math>\begin{matrix}<br />
F(q;\theta) & = & \sum_z q(z|x) log \frac{P(x,z|\theta)}{q(z|x)} \\<br />
& = & \sum_z q(z|x)log(P(x,z|\theta)) - \sum_z q(z|x)log(q(z|x))\\<br />
\end{matrix}</math></center><br />
<br />
Since the second part of the equation is only a constant with respect to <math>\theta</math>, in the M-step we only need to maximize the expectation of the COMPLETE likelihood. The complete likelihood is the only part that still depends on <math>\theta</math>.<br />
<br />
====Notes About E-Step====<br />
In this step we are trying to find an estimate for <math>q(z|x)</math>. To do this we have to maximise <math>\mathfrak{L}(q;\theta^{(t)})</math>.<br />
<center><math><br />
\mathfrak{L}(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) <br />
</math></center><br />
It can be shown that <math>q(z|x) = P(z|x,\theta^{(t)})</math>. So, replace <math>q(z|x)</math> with <math>P(z|x,\theta^{(t)})</math>.<br />
<center><math>\begin{matrix}<br />
\mathfrak{L}(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\<br />
& = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\<br />
& = & log(P(x|\theta^{(t)})) \\<br />
& = & l(\theta; x)<br />
\end{matrix}</math></center><br />
<br />
But <math>\mathfrak{L}(q;\theta^{(t)})</math> is the lower bound of <math> l(\theta, x) </math> so that means that <math>P(z|x,\theta^{(t)})</math> is in fact the maximum for <math>\mathfrak{L}</math>. We can therefore see that we only need to do the E-Step once and then we can use that result for each repetition of the M-Step. <br />
<br />
<br />
From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to: <br />
<br />
'''E-Step''' <br /><br />
Find <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> only once. <br /><br />
'''M-Step''' <br /><br />
Maximise <math> E[l_c(\theta; x, z)]_{P(z|x, \theta)} </math> with respect to <math>theta</math>. <br />
<br />
The EM Algorithm is probably best understood through examples.<br />
<br />
====EM Algorithm Example====<br />
<br />
Suppose we have the two independent and identically distributed random variables:<br />
<center><math> Y_1, Y_2 \sim P(y|\theta) = \theta e^{-\theta y} </math></center><br />
In our case <math>y_1 = 5</math> has been observed but <math>y_2 = ?</math> has not. Our task is to find an estimate for <math>\theta</math>. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta} \\<br />
l(\theta; Data) & = & log(\theta)- 5\theta<br />
\end{matrix}</math></center><br />
We take our derivative:<br />
<center><math>\begin{matrix}<br />
& \frac{dl}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{1}{\theta}-5 & = 0 \\<br />
\Rightarrow & \theta & = 0.2<br />
\end{matrix}</math></center><br />
And now we can try the same problem with the EM Algorithm. <br />
<center><math>\begin{matrix}<br />
L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\<br />
l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta<br />
\end{matrix}</math></center><br />
E-Step <br />
<center><math> E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}</math></center><br />
M-Step<br />
<center><math>\begin{matrix}<br />
& \frac{dl_c}{d\theta} & = 0 \\<br />
\Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\<br />
\Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1}<br />
\end{matrix}</math></center><br />
Now we pick an initial value for <math>\theta</math>. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick <math>\theta = 10</math>. Now we repeat the M-Step until the value converges.<br />
<center><math>\begin{matrix}<br />
\theta^{(1)} & = & 10 \\<br />
\theta^{(2)} & = & 0.392 \\<br />
\theta^{(3)} & = & 0.2648 \\<br />
... & & \\<br />
\theta^{(k)} & \simeq & 0.2 <br />
\end{matrix}</math></center><br />
And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm. <br />
<br />
===Mixture Models===<br />
In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another. <br />
<br />
====Mixture of Gaussian ====<br />
<br />
Given <math>P(x|\theta) = \alpha N(x;\mu_1,\sigma_1) + (1-\alpha)N(x;\mu_2,\sigma_2)</math>. We sample the data, <math>Data = \{x_1,x_2...x_n\} </math> and we know that <math>x_1,x_2...x_n</math> are iid. from <math>P(x|\theta)</math>.<br /><br />
We would like to find:<br />
<center><math>\theta = \{\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\} </math></center><br />
<br />
We have no missing data here so we can try to find the parameter estimates using the ML method. <br />
<center><math> L(\theta; Data) = \prod_i=1...n (\alpha N(x_i, \mu_1, \sigma_1) + (1 - \alpha) N(x_i, \mu_2, \sigma_2)) </math></center><br />
And then we need to take the log to find <math>l(\theta, Data)</math> and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters. <br /><br />
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute. <br />
<center><math> z_i = 1 \text{ with prob. } \alpha </math></center><br />
<center><math> z_i = 0 \text{ with prob. } (1-\alpha) </math></center><br />
Now we have a data set that includes our latent variable <math>z_i</math>:<br />
<center><math> Data = \{(x_1,z_1),(x_2,z_2)...(x_n,z_n)\} </math></center><br />
We can calculate the joint pdf by: <br />
<center><math> P(x_i,z_i|\theta)=P(x_i|z_i,\theta)P(z_i|\theta) </math></center><br />
Let,<br />
<math></math> P(x_i|z_i,\theta)=<br />
\left\{ \begin{tabular}{l l l}<br />
<math> \phi_1(x_i)=N(x;\mu_1,\sigma_1)</math> & if & <math> z_i = 1 </math> <br /><br />
<math> \phi_2(x_i)=N(x;\mu_2,\sigma_2)</math> & if & <math> z_i = 0 </math><br />
\end{tabular} \right. <math></math><br />
Now we can write <br />
<center><math> P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i} </math></center><br />
and <br />
<center><math> P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
We can write the joint pdf as:<br />
<center><math> P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
From the joint pdf we can get the likelihood function as: <br />
<center><math> L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i} </math></center><br />
Then take the log and find the log likelihood:<br />
<center><math> l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha) </math></center><br />
In the E-step we need to find the expectation of <math>l_c</math><br />
<center><math> E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha) </math></center><br />
For now we can assume that <math><z_i></math> is known and assign it a value, let <math> <z_i>=w_i</math><br /><br />
In M-step, we have to update our data by assuming the expectation is fixed<br />
<center><math> \theta^{(t+1)} <-- argmax_{\theta} E[l_c(\theta;D)] </math></center><br />
Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).<br />
<center><math>\begin{matrix}<br />
\frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\<br />
\frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\<br />
\frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\<br />
\frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)}<br />
\end{matrix}</math></center><br />
We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute <math><z_i>=w_i</math> in the E-step. <br />
<center><math>\begin{matrix}<br />
<z_i> & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\<br />
& = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\<br />
& = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\<br />
& = & P(z_i=1|x_i,\theta^{(t)}) \\<br />
P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\<br />
& = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\<br />
& = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}<br />
\end{matrix}</math></center><br />
We can now combine the two steps and we get the expectation <br />
<center><math>E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} </math></center><br />
Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.<br />
<br />
<br />
The mixture model can be summarized as:<br />
<br />
* In each step, a state will be selected according to <math>p(z)</math>. <br />
* Given a state, a data vector is drawn from <math>p(x|z)</math>.<br />
* The value of each state is independent from the previous state.<br />
<br />
A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table. <br /><br />
\begin{tabular}{|c|c|c|}<br />
\hline<br />
& H & T <br /><br />
coin1 & 0.3 & 0.7 <br /><br />
coin2 & 0.1 & 0.9 <br /><br />
\hline<br />
\end{tabular}<br /><br />
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.<br />
<br />
=Appendix: Graph Drawing Tools=<br />
===Graphviz===<br />
[http://www.graphviz.org/ Website]<br />
<br />
"Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains."<br />
<ref>http://www.graphviz.org/</ref><br />
<br />
There is a wiki extension developed, called Wikitex, which makes it possible to make use of this package in wiki pages. [http://wikisophia.org/wiki/Wikitex#Graph Here] is an example.<br />
<br />
===AISee===<br />
[http://www.aisee.com/ Website]<br />
<br />
AISee is a commercial graph visualization software. The free trial version has almost all the features of the full version except that it should not be used for commercial purposes.<br />
<br />
===TikZ===<br />
[http://www.texample.net/tikz/ Website]<br />
<br />
"TikZ and PGF are TeX packages for creating graphics programmatically. TikZ is build on top of PGF and allows you to create sophisticated graphics in a rather intuitive and easy manner." <ref><br />
http://www.texample.net/tikz/<br />
</ref><br />
<br />
===Xfig===<br />
"Xfig" is an open source drawing software used to create objects of various geometry. It can be installed on both windows and unix based machines. <br />
[http://www.xfig.org/ Website]</div>Hyeganeh