stat946w18/Implicit Causal Models for Genome-wide Association Studies

Introduction and Motivation

There is currently much progress in probabilistic models which could lead to the development of rich generative models. The models have been applied with neural networks, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focused on capturing statistical relationships rather than causal relationships. Causal relationships are relationships where one event is a result of another event, i.e. a cause and effect. Causal models give us a sense of how manipulating the generative process could change the final results.

Genome-wide association studies (GWAS) are examples of causal relationships. Genome is basically the sum of all DNAs in an organism and contain information about the organism's attributes. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to are single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is investigated: first, predict which one or more SNPs cause the disease; second, target the selected SNPs to cure the disease.

The figure below depicts an example Manhattan plot for a GWAS. Each dot represents an SNP. The x-axis is the chromosome location, and the y-axis is the negative log of the association p-value between the SNP and the disease, so points with the largest values represent strongly associated risk loci.

This paper focuses on two challenges to combining modern probabilistic models and causality. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function $f$ and a noise $n$. For working simplicity, we usually assume $f$ as a linear model with Gaussian noise. However problems like GWAS require models with nonlinear, learnable interactions among the inputs and the noise.

The second challenge is how to address latent population-based confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor know the underlying structure. For example, in GWAS, both latent population structure, i.e., subgroups in the population with ancestry differences, and relatedness among sample individuals produce spurious correlations among SNPs to the trait of interest. The existing methods cannot easily accommodate the complex latent structure.

For the first challenge, the authors develop implicit causal models, a class of causal models that leverages neural architectures with an implicit density. With GWAS, implicit causal models generalize previous methods to capture important nonlinearities, such as gene-gene and gene-population interaction. Building on this, for the second challenge, they describe an implicit causal model that adjusts for population-confounders by sharing strength across examples (genes).

There has been an increasing number of works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.

Implicit Causal Models

Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.

Probabilistic Causal Models

Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider background noise $\epsilon$, representing unknown background quantities which are jointly independent and global variable $\beta$, some function of this noise, where

Each $\beta$ and $x$ is a function of noise; $y$ is a function of noise and $x$

The target is the causal mechanism $f_y$ so that the causal effect $p(y|do(X=x),\beta)$ can be calculated. $do(X=x)$ means that we specify a value of $X$ under the fixed structure $\beta$. By other paper’s work, it is assumed that $p(y|do(x),\beta) = p(y|x, \beta)$.

An example of probabilistic causal models is additive noise model.

$f(.)$ is usually a linear function or spline functions for nonlinearities. $\epsilon$ is assumed to be standard normal, as well as $y$. Thus the posterior $p(\theta | x, y, \beta)$ can be represented as

where $p(\theta)$ is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.

Implicit Causal Models

The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of using an additive noise term, implicit causal models directly take noise $\epsilon$ as input and outputs $x$ given parameter $\theta$.

$x=g(\epsilon | \theta), \epsilon \tilde s(\cdot)$

The causal diagram has changed to:

They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description:

Implicit Causal Models with Latent Confounders

Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.

Causal Inference with a Latent Confounder

Similar to before, the interest is the causal effect $p(y|do(x_m), x_{-m})$. Here, the SNPs other than $x_m$ is also under consideration. However, it is confounded by the unobserved confounder $z_n$. As a result, the standard inference method cannot be used in this case.

The paper proposed a new method which include the latent confounders. For each subject $n=1,…,N$ and each SNP $m=1,…,M$,

The mechanism for latent confounder $z_n$ is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.

The posterior of $\theta$ is needed to be calculate in order to estimate the mechanism $g_y$ as well as the causal effect $p(y|do(x_m), x_{-m})$, so that it can be explained how changes to each SNP $X_m$ cause changes to the trait $Y$.

Note that the latent structure $p(z|x, y)$ is assumed known.

In general, causal inference with latent confounders can be dangerous: it uses the data twice, and thus it may bias the estimates of each arrow $X_m → Y$. Why is this justified? This is answered below:

Proposition 1. Assume the causal graph of Figure 2 (left) is correct and that the true distribution resides in some configuration of the parameters of the causal model (Figure 2 (right)). Then the posterior $p(θ | x, y)$ provides a consistent estimator of the causal mechanism $f_y$.

Proposition 1 rigorizes previous methods in the framework of probabilistic causal models. The intuition is that as more SNPs arrive (“M → ∞, N fixed”), the posterior concentrates at the true confounders $z_n$, and thus we can estimate the causal mechanism given each data point’s confounder $z_n$. As more data points arrive (“N → ∞, M fixed”), we can estimate the causal mechanism given any confounder $z_n$ as there is an infinity of them.

Implicit Causal Model with a Latent Confounder

This section is the algorithm and functions to implementing an implicit causal model for GWAS.

Generative Process of Confounders $z_n$.

The distribution of confounders is set as standard normal. $z_n \in R^K$ , where $K$ is the dimension of $z_n$ and $K$ should make the latent space as close as possible to the true population structural.

Generative Process of SNPs $x_{nm}$.

Given SNP is coded for,

The authors defined a $Binomial(2,\pi_{nm})$ distribution on $x_{nm}$. And used logistic factor analysis to design the SNP matrix.

A SNP matrix looks like this:

Since logistic factor analysis makes strong assumptions, this paper suggests using a neural network to relax these assumptions,

This renders the outputs to be a full $N*M$ matrix due the the variables $w_m$, which act as principal component in PCA. Here, $\phi$ has a standard normal prior distribution. The weights $w$ and biases $\phi$ are shared over the $m$ SNPs and $n$ individuals, which makes it possible to learn nonlinear interactions between $z_n$ and $w_m$.

Generative Process of Traits $y_n$.

Previously, each trait is modeled by a linear regression,

This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,

Likelihood-free Variational Inference

Calculating the posterior of $\theta$ is the key of applying the implicit causal model with latent confounders.

could be reduces to

However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables $w_m$ and $z_n$ are all assumed to be Normal,

For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used:

Empirical Study

The authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. Four methods were compared:

• implicit causal model (ICM);
• PCA with linear regression (PCA);
• a linear mixed model (LMM);
• logistic factor analysis with inverse regression (GCAT).

The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization.

Simulation Study

Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. There are four datasets used in this simulation study:

1. HapMap [Balding-Nichols model]
2. 1000 Genomes Project (TGP) [PCA]
• Human Genome Diversity project (HGDP) [PCA]
• HGDP [Pritchard-Stephens-Donelly model]
3. A latent spatial position of individuals for population structure [spatial]

The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as the wrong prediction.

The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poorly on PSD and Spatial when $a$ is small, but the ICM achieved a significantly high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.

Real-data Analysis

They also applied ICM to GWAS of Northern Finland Birth Cohorts, which measure 10 metabolic traits and also contain 324,160 SNPs and 5,027 individuals. The data came from the database of Genotypes and Phenotypes (dbGaP) and used the same preprocessing as Song et al. Ten implicit causal models were fitted, one for each trait to be modeled. For each of the 10 implicit causal models the dimension of the counfounders was set to be six, same as what was used in the paper by Song. The SNP network used 512 hidden units in both layers and the trait network used 32 and 256. et al. for comparable models in Table 2.

The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" (association tests without accounting for hidden relatedness of study samples) are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.

Conclusion

This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.

By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.

The authors also believed this GWAS application is only the start of the usage of implicit causal models. The authors suggest that it might also be successfully used in the design of dynamic theories in high-energy physics or for modeling discrete choices in economics.

Critique

This paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.

The neural network used in this paper is a very simple feed-forward 2 hidden-layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.

It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showing some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model over the previous methods, such as GCAT or LMM.

Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually, they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.

Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs. This could be a future work as well.