quantifying cancer progression with conjunctive Bayesian networks

Motivation

Tumor progression is characterized by a sequence of multiple genetic mutations that arise due to activation of oncogenes and inactivation of tumor suppressor genes. It is still unknown about the temporal order of these mutations, as carcinogenic process is a slow process and takes several years. Biologically motivated mathematical models such as "Evolutionary dynamics" have been used to describe the sequence of events. Several statistical models such as oncogenetic trees, network trees, probabilistic network models, etc have been used to model disease progression, in particular cancer. Genetic events happen in no specific order, due to which a single node can have multiple parents. This lead to the use of a more generalization framework for tree models called as the conjunctive Bayesian networks. In simple words, a conjunctive Bayesian network is a directed acyclic graph that allows for multiple parent nodes. One biggest advantage of this model is that it can model multiple mutational pathways.

Introduction

A poset $P$ or a partially ordered set has a binary relation "$\lt$" which has the following properties:

• Reflexive
• Antisymmetry
• Transitive

In this model, $P$ denotes the set of mutational events and the binary relation defines the order of occurrence of the constraints. For example, $p \lt q$ denotes that mutation $q$ can occur only after the occurrence of mutation $p$. We denote $p$ as the parent of $q$ if there exists no node $r \in P$ such that $r\neq p$, $r\neq q$ and $p\lt r\lt q$. Denote $p\rightarrow q$ to say that $p$ is the parent of $q$. The set of all parents is denoted by $pa(q)$. We now construct the distributive lattice of order ideals of $P$ denoted by $J(P)$. The distributive lattice is defined as follows: all the subsets $S\subset P$ and $S\in J(P)$. We say that $S\in J(P)$ if and only if for all $q\in S$ and $p \lt q$ then $p\in S$.

A conjunctive Bayesian network (CBN) is characterized by a set of events $E$ along with a partial order "$\lt$" that is defined on the events and with parameters $\theta_e$ for each events $e\in E$. The state space of this CBN is the distributive lattice $J(E)$ of order ideals in the events set $E$. Elements of distributive lattice are called as genotypes. To summarize, a CBN is:

• a model that places partial order on the genetic mutations
• assumes total number of mutations as fixed, say $n$
• model assumes no reverse mutations at any point of time in the process
• generates a lattice of possible genotypes

Here is a simple example that illustrates a poset and construction of distributive lattice taken from <ref name="R1"> Beerenwinkel N. and Sullivant S, “Markov models for accumulating mutations,” in Biometrika, 96, 3 ,pp. 645–661, 2009.</ref>

A poset that is defined on the four element set as $[4]={1,2,3,4}$ constrained on the binary relation $1\lt 3$, $2\lt 3$ and $2\lt 4$.

Fig.1 A sample poset with 4 mutations

We notice that mutation 3 cannot happen until the occurrence of mutations 1 and 2. Mutation 4 cannot happen until the occurrence of mutation 2. Then the distributive lattice can be constructed as follows for Fig 1:

Fig.2 lattice of order ideals

Bayesian networks and detection of cancer

A tumor has to grow to a minimum size in order to detect it using CT scans or MRI scans. Firstly, a malignant tumor has to grow to a particular size and secondly the clinical diagnosis has to be correct. A model is formulated as follows:

• $T$: Waiting time for tumor to develop
• $T_s$: Waiting time for clinical diagnosis

The dynamics of CBN is a hidden process due to one of the observation variable, and hence this model is treated as a hidden CBN. Figure 3 shows a simple CBN with a hidden process.

File:CBNupdated1.png
Fig.3 Sample CBN for cancer diagnosis

Both the times defined above are random variables. In general, $T$ and $T_s$ are not known due to which we assume that both are independent. Therefore their joint distribution is given by $f(t,t_s)=f(t)f(t_s)$. Cancer is detected only when it is observed during the diagnosis time. Suppose $X\in{0,1}$ be a binary random variable that indicates presence of cancer at $T_s$. If cancer is diagnosed at $T_s$ then $X$ is set to be one. Then the probability of $X$ is given by:

$Prob(X) = \int_0^{\infty}\int_0^{\infty}Prob(X|T=t,T_s=t_s)f(t)f(t_s)dt dt_s$

where the conditional probability

$Prob(X=1|T=t,T_s=t_s)=I(t\lt t_s)$

is called as the indicator function. Model described above assumes that diagnosis is always correct. Practically, there is always a chance that diagnosis might go wrong. Let us suppose that diagnosis is overlooked or misdiagnosed with a probability $\epsilon$. This assumption leads us to saying that, clinical diagnosis is also regarded a probabilistic event $Y$ that depends on $X$. Hence, the probability of $Y$ is given by

$Prob(Y) = \sum _{X=0,1}Prob_{\epsilon}(Y|X)Prob(X)$

where $Prob(X)$ is defined as above and,

$Prob_{\epsilon}(Y|X)= \epsilon^{I(Y\neq X)} (1-\epsilon)^{I(Y=X)}$

Variables ${T,T_s,X,Y}$ form a Bayesian network and the joint density can be factorized into conditional densities.

Conjunctive Bayesian networks for multiple pathways

Model discussed in the above section is extended to examine progression of cancer. Let us assume that there are "n" mutations in the system and waiting time for one mutation is denoted by $T_i$ for $1\leq i \leq n$. Clearly, waiting time for each mutation is a random variable. Let us look at an example of defining waiting times for each mutational event in a poset $P$. Suppose that for each event in the poset we define a random variable $Z_i$$\sim$$\exp(\lambda_i)$, where $\lambda_i$ is a parameter. Then waiting times for the events defined in figure 1 are the following:

• $T_1=Z_1$
• $T_2=Z_2$
• $T_3=max(T_1,T_2)+Z_3$
• $T_4=T_2+Z_4$
Fig.3 Hidden CBN with multiple pathways

Now we defined a general form for the waiting times of "n" events in a poset $P$ as follows:

$T_i \sim \exp(\lambda_i) + max _{j\in pa(i)} T_j$

The density function $T_i$ conditioned on all its parent nodes is given by :

$f_{T_i|T_j,j\in pa(i)} (t_i|{t_j}) = \lambda_i \exp(-\lambda_i(t_i-max_{j\in pa(i)}t_j))I(t_i\gt max_{j\in pa(i)}t_j)$

Random variable $T$ is a vector of time points that satisfy binary relations defined on the poset $P$ and the relation $(t_i\gt max_{j\in pa(i)}t_j)$ for all $i \in [n]$. Assume that the waiting time $T_s$ is also independently exponentially distributed with parameter $\lambda_s$, i.e. $T_s$$\sim$$\exp(\lambda_s).$. Like in the previous section, we define $X=(X_1,X_2,...,X_n) \in {0,1}^n$ as a binary vector. Then the conditional density of $X$ is factorized as

$Prob(X|T,T_s) = \prod_{i=1}^n Prob(X_i|T_i,T_s)$

and we obtain for a particular poset $P$ as follows:

$Prob_{\lambda,P}(X) = Prob_{\lambda,P} [max_{i:X_i=1}T_i \lt T_s \lt min_{i:X_j=0}T_j ]$

where the minimum condition makes sure that the unobserved mutations have not occurred before the stopping time $T_s$. It requires that al the mutations $X_i$ have been diagnosed correctly for the models parameter estimation. Due to diagnostic errors, there might be a chance of observation errors which we denote by $Y=(Y_1,Y_2,...,Y_n)$. This observation process is modeled as probability of false observation by a parameter $\epsilon$. Due to the independency of the conditioned variables $Y_i|X_i$ for each $i \in [n]$, the conditional probability of observation $Y$ for a given $X$ is:

$Prob_{\epsilon}(Y|X) = \prod_{i=1}^n Prob_{\epsilon}(Y_i|X_i) = \epsilon ^{d(X,Y)} (1-\epsilon)^{n-d(X,Y)}$

where $d(X,Y)=\sum_{i=1}^n|X_i-Y_i|$ is the Hamming distance between the genotype $X$ and the observation genotype $Y$. In simple words, Hamming distance counts the number of mismatches between the observed genotype and the true genotype. The dynamics is a hidden process because the model is censored by a stopping event and the observation variable has errors. For parameter estimation, we use Bayes theorem to compute the posterior probability of observing the true genotype $X$ given an observation genotype $Y$ as

$Prob_{\epsilon}(X|Y) = \frac{Prob_{\lambda,P}(X)Prob_{\epsilon}(Y|X)} {\sum_{X\in J(P)}Prob_{\lambda,P}(X) Prob_{\epsilon}(Y|X)}$

Parameter estimation

Model parameters $\epsilon, \lambda$ cane be estimated using "Expectation-Maximization" algorithm. The method of simulated annealing is used to estimate the poset $P$. The joint probability of "N" independent observations $Y=(Y_1,Y_2,...,Y_n)$ can be factorized as

$Prob_{\epsilon,\lambda,P}(Y) = \prod_{l=1} ^N Prob_{\epsilon,\lambda,P}(Y^{(l)}) = \prod_{l=1} ^N \sum_{X\in J(P)} Prob_{\lambda,P}(X) Prob_{\epsilon}(Y^{(l)}|X)$

Then the log-likelihood is given by

$l_Y(\epsilon,\lambda,P) = \sum_{l=1} ^N \log(\sum_{X\in J(P)} \epsilon ^{d(X,Y^{(l)})} (1-\epsilon)^{n-d(X,Y^{(l)})} Prob_{\lambda,P}(X))$

Question: To maximize the log-likelihood for given observation $Y$. It depends on the error rate $\epsilon$ and the waiting times $\lambda$.

• EM algorithm can be used to estimate $\lambda$ if $P, X$ are known quantities.
• For fixed $P$ and a hidden $X$, this can be incorporated to a nested loop in the EM algorithm; where in the external loop estimates the parameter $\tilde \lambda$ and the internal loop estimates the error rate $\tilde\epsilon$ for a given $\tilde \lambda^{(k)}$

Note: If the variables $X, Y$ are known then the maximum likelihood for the observation error rate is average distance between the two genotypes.

Since the variable $X$ is hidden, the error rate $\tilde\epsilon$ is computed which is the E-step and use it in the M-step to compute maximum likelihood estimate as

$\tilde\epsilon^{(j+1)} = \frac{1}{nN} \sum_{l=1} ^N (\sum_{X\in J(P)} d(X,Y^{l})Prob_{\tilde\epsilon^{(j},\tilde\lambda^{(k)},P}(X|Y^{(l)}))$

The convergence of this estimate gives $\tilde\epsilon$ that locally maximizes the function $l_Y(\epsilon,\lambda,P)$, which is used to estimate $\lambda$. For "N" realization of the waiting times parameters $T_i$, maximum likelihood estimate for the parameter $\lambda_i$ is

$\tilde\lambda_i = \frac{N}{\sum_{l=1}^N T_i^{(l)}-max_{j\in pa(i)}T_j^{(l)}}$

wherein the denominator can be replaced by the following quantity

$E_{\tilde\lambda^{(k)},\tilde\epsilon^{(k)},P}( T_i^{(l)}-max_{m\in pa(i)}T_m|Y^{l}) = \sum_{X\in J(P)} E_{\tilde\lambda^{(k)},P} (T_i^{(l)}-max_{m\in pa(i)}T_m|X) Prob_{\tilde\lambda^{(k)},\tilde\epsilon^{(k)},P}(X|Y^{l})$

Expectations are computed through dynamics programming. The values estimated from the above equation are inserted into the M-step to estimate $\tilde\lambda^{(k+1)}$. This process is done till the convergence is achieved. Since EM algorithm locally maximizes the data for a particular poset "P". Maximum likelihood poset is selected as $\tilde P = arg max_{P} l_Y(\tilde\epsilon,\tilde\lambda,P)$. A heuristic way to determine this value is through simulated annealing method. In this approach, $l_Y(\tilde\epsilon,\tilde\lambda,P)$ is computed for a given data set and a poset "P", then randomly another poset $P_1$ is generated and the algorithm continues till $l_Y(\tilde\epsilon,\tilde\lambda,P_1) \gt l_Y(\tilde\epsilon,\tilde\lambda,P)$.

Results

This model analyzed various data sets for different cancer types such as renal cancer, colon cancer.

Notes

Beerenwinkel (one of the authors) previously put some assumptions and followed them when modelling the accumulative evolutionary process. Such assumptions are:

1. Substitutions do not occur independently. There are preferred evolutionary pathways in which mutations are fixed

2. The fixation mutations into the population is definite. This means that substitutions are non-reversible

3. At each time point, the virus population is dominated by a single strain and clones are independent and (sometimes erroneous) copies of this genotype

Improvements

As mentioned in the paper, an improvement on the proposed model would be to use different parameters $\varepsilon^+$ and $\varepsilon^-$ for false positives and false negatives in the error model. Beerenwinkel and Drton have developed this idea.

Let $\varepsilon^+ = (\varepsilon_1^+,...,\varepsilon_M^+) \in [0, 1]^M$ and $\varepsilon^- = (\varepsilon_1^-,...,\varepsilon_M^-) \in [0, 1]^M$ be parameter vectors that contain the mutation specific probabilities of observing a false positive and a false negative respectively. False positives (negatives) are mutations observed in clones derived from a virus population that is in mutant state at such time point. The false positive and false negative negative rates summarize differences from the population state. Then, these parameters quantify the expected genetic diversity of the virus population. Conditionally upon the hidden state $X_{jm}$, the probabilities of observing mutation $m$ in clone $k$ at time point $t_j$ are as follows:

$\begin{matrix} \theta^l(\varepsilon_m^+, \varepsilon_m^-) = \begin{matrix} & 0 & 1\\ 0 & 1-\varepsilon_m^+ & \varepsilon_m^+\\ 1 & \varepsilon_m^- & 1-\varepsilon_m^- \end{matrix} \end{matrix}$

The entries of this matrix are the conditional probabilities

$\begin{matrix} \theta^l(\varepsilon_m^+, \varepsilon_m^-)_{x_{jm},y_{jkm}} = Prob(Y_{jkm}=y_{jkm}|X_{jm}=x_{jm}) \end{matrix}$

then the model is concluded accordingly.

<references />