stat946w18/Implicit Causal Models for Genome-wide Association Studies
Presented by
1. Dongyang Yang
Introduction and Related Work
There is progression in probabilistic models which could develop rich generative models. The models have been expanded with neural architectures, implicit densities, and with scalable algorithms for their Bayesian inference. However, most of the models are focus on capturing statistical relationships rather than causal relationships. Causal models give us a sense on how manipulate the generative process could change the final results.
Genome-wide association studies (GWAS) are examples of causal relationship. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to is single nucleotide polymorphisms (SNPs), and we treat getting a particular disease as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, we want to look at the causation between SNPs and diseases: first, predict which one or multiple SNPs cause the disease; second, target the selected SNPs to cure the disease.
This paper dealt with two questions. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function f and a noise n. For the working simplicity, we usually assume f as a linear model with a Gaussian noise. However, proof has shown that in GWAS, it is necessary to accommodate non-linearity and interactions between multiple genes into the models.
The second accomplishment of this paper is that it addresses the problem caused by latent confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor knowing the underlying structure. In this paper, they developed implicit causal models which can adjust for confounders.
There has been growing work on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.
Implicit Causal Models
Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.
Probabilistic Causal Models
Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider a global variable ‘\beta and noise ‘\epsilon, where [Equation 1 - beta] Each ‘\beta and ‘x is a function of noise and ‘y is a function of noise and ‘x. [Equation 1]
The target is the causal mechanism ‘f_y so that the causal effect ‘p(y|do(X=x),\beta) can be calculated. ‘do(X=x) means that we specify a value of X under the fixed structure ‘\beta. By other paper’s work, it is assumed that ‘p(y|do(x),\beta) = p(y|x\beta). [figure 1]
An example is additive noise model. [equation 2 – function y] ‘f(.) is usually a linear function or spline functions for nonlinearities. ‘\epsilon is assumed to be standard normal, as well as ‘y. Thus the posterior ‘p(\theta | x,y, \beata) can be represented as [equation 2] where ‘p(\theta) is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.
Implicit Causal Models
The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise ‘\epsilon into a neural network and output ‘x.
The causal diagram has changed to: [figure 2]
They use fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. [therom]
Implicit Causal Models with Latent Confounders
Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.
Causal Inference with a Latent Confounder
Same as before, the interest is the causal effect ‘p(y|do(x_m), x_{-m}). Here, the SNPs other than ‘x_m is also under consideration. However, it is confounded by the unobserved confounder ‘z_n. As a result, the standard inference method cannot be used in this case.
The paper proposed a new method which include the latent confounders. For each subject ‘n=1,…,N and each SNP ‘m=1,…,M, [equation 4]
The mechanism for latent confounder ‘z_n is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.
The posterior of ‘\theta is needed to be calculate in order to estimate the mechanism ‘g_y as well as the causal effect ‘p(y|do(x_m), x_{-m}), so to explain how changes to each SNP ‘X_m case changes to the trait ‘Y. [equation 5]
Note that the latent structure ‘p(z|x,y) is assumed known.
Implicit Causal Model with a Latent Confounder
This section is the algorithm and function to implementing an implicit causal model for GWAS.
Generative Process of Confounders ‘z_n. The distribution of confounders is set as standard normal. ‘z_n \in R^K , where ‘K is the dimension of ‘z_n and ‘K should make the latent space as close as possible to the true population structural.
Generative Process of SNPs ‘x_{nm}. Given SNP is coded for 0 (no major allele), 1(only 1 major allele), 2(2 major alleles), the authors define a ‘Binomial(2,\pi_{nm}) distribution on ‘x_{nm}. And use logistic factor analysis to design the SNP matrix. [equation logit \pi]
Since logistic factor analysis makes strong assumptions, this paper suggests to use a neural network to relax these assumptions, [equation logit \pi NN] This renders the outputs to be a full ‘N*M matrix due the the variables ‘w_m, which act as principal component in PCA.
Generative Process of Traints ‘y_n. Previously, each trait is modeled by a linear regression, [equation y_n] This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar, [equation y_n NN]
Likelihood-free Variational Inference
Calculating the posterior of ‘\theta is the key of applying the implicit causal model with latent confounders. [eq 5] could be reduces to [eq pg6 4]
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables ‘w_m and ‘z_n are all assumed to be Normal. [eq pg 7]
For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used: [EM algorithm]
Empirical Study
The author performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. Four methods were compared: implicit causal model (ICM); PCA with linear regression (PCA); a linear mixed model (LMM); and logistic factor analysis with inverse regression (GCAT). The feedforward neural networks for traits and SNPs as fully connected with two hidden layers using ReLU activation function, and batch normalization.
Simulation Study
Real-data Analysis
We also apply our model to a real-world GWAS of Northern Finland Birth Cohorts; our model indeed captures real causal relationships—identifying similar SNPs as previous state of the art.