stat946w18/Implicit Causal Models for Genome-wide Association Studies

From statwiki
Revision as of 16:18, 14 March 2018 by D39yang (talk | contribs)
Jump to navigation Jump to search

Introduction and Motivation

There is progression in probabilistic models which could develop rich generative models. The models have been expanded with neural network, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focus on capturing statistical relationships rather than causal relationships. Causal models give us a sense on how manipulate the generative process could change the final results.

Genome-wide association studies (GWAS) are examples of causal relationship. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to is single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is interested: first, predict which one or multiple SNPs cause the disease; second, target the selected SNPs to cure the disease.

This paper dealt with two questions. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function f and a noise n. For the working simplicity, we usually assume f as a linear model with a Gaussian noise. However, proof has shown that in GWAS, it is necessary to accommodate non-linearity and interactions between multiple genes into the models.

The second accomplishment of this paper is that it addressed the problem caused by latent confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor knowing the underlying structure. In this paper, they developed implicit causal models which can adjust for confounders.

There has been growing works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.


Implicit Causal Models

Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first.

Probabilistic Causal Models

Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider a global variable ‘\beta and noise ‘\epsilon, where [Equation 1 - beta] Each ‘\beta and ‘x is a function of noise and ‘y is a function of noise and ‘x. [Equation 1]

The target is the causal mechanism ‘f_y so that the causal effect ‘p(y|do(X=x),\beta) can be calculated. ‘do(X=x) means that we specify a value of X under the fixed structure ‘\beta. By other paper’s work, it is assumed that ‘p(y|do(x),\beta) = p(y|x\beta). [figure 1]

An example is additive noise model. [equation 2 – function y] ‘f(.) is usually a linear function or spline functions for nonlinearities. ‘\epsilon is assumed to be standard normal, as well as ‘y. Thus the posterior ‘p(\theta | x,y, \beata) can be represented as [equation 2] where ‘p(\theta) is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.


Implicit Causal Models

The difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise ‘\epsilon into a neural network and output ‘x.

The causal diagram has changed to: [figure 2]

They use fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. [therom]


Implicit Causal Models with Latent Confounders

Previously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.

Causal Inference with a Latent Confounder

Same as before, the interest is the causal effect ‘p(y|do(x_m), x_{-m}). Here, the SNPs other than ‘x_m is also under consideration. However, it is confounded by the unobserved confounder ‘z_n. As a result, the standard inference method cannot be used in this case.

The paper proposed a new method which include the latent confounders. For each subject ‘n=1,…,N and each SNP ‘m=1,…,M, [equation 4]


The mechanism for latent confounder ‘z_n is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.

The posterior of ‘\theta is needed to be calculate in order to estimate the mechanism ‘g_y as well as the causal effect ‘p(y|do(x_m), x_{-m}), so to explain how changes to each SNP ‘X_m case changes to the trait ‘Y. [equation 5]

Note that the latent structure ‘p(z|x,y) is assumed known.


Implicit Causal Model with a Latent Confounder

This section is the algorithm and function to implementing an implicit causal model for GWAS.

Generative Process of Confounders ‘z_n. The distribution of confounders is set as standard normal. ‘z_n \in R^K , where ‘K is the dimension of ‘z_n and ‘K should make the latent space as close as possible to the true population structural.

Generative Process of SNPs ‘x_{nm}. Given SNP is coded for 0 (no major allele), 1(only 1 major allele), 2(2 major alleles), the authors define a ‘Binomial(2,\pi_{nm}) distribution on ‘x_{nm}. And use logistic factor analysis to design the SNP matrix. [equation logit \pi]

Since logistic factor analysis makes strong assumptions, this paper suggests to use a neural network to relax these assumptions, [equation logit \pi NN] This renders the outputs to be a full ‘N*M matrix due the the variables ‘w_m, which act as principal component in PCA.

Generative Process of Traints ‘y_n. Previously, each trait is modeled by a linear regression, [equation y_n] This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar, [equation y_n NN]


Likelihood-free Variational Inference

Calculating the posterior of ‘\theta is the key of applying the implicit causal model with latent confounders. [eq 5] could be reduces to [eq pg6 4]

However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables ‘w_m and ‘z_n are all assumed to be Normal. [eq pg 7]

For LFVI applied to GWAS, the algorithm which similar to the EM algorithm has been used: [EM algorithm]


Empirical Study

The author performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. Four methods were compared: - implicit causal model (ICM); - PCA with linear regression (PCA); - a linear mixed model (LMM); - logistic factor analysis with inverse regression (GCAT).

The feedforward neural networks for traits and SNPs as fully connected with two hidden layers using ReLU activation function, and batch normalization. 

Simulation Study

Based on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. There are four dataset used in this simulation study: - HapMap [Balding-Nichols model] - 1000 Genomes Project (TGP) [PCA] - Human Genome Diversity project (HGDP) [PCA] - HGDP [Pritchard-Stephens-Donelly model] - A latent spatial position of individuals for population structure []

The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives is considered as wrong prediction.

[table 1]

The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poor on PSD and Spatial when a is small, but the ICM achieved a significant high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.


Real-data Analysis

They also applied ICM to a real-world GWAS of Northern Finland Birth Cohorts which contain 324,160 SNPs and 5,027 individuals. Ten implicit causal models were fitted and the 2 neural networks both with two hidden layers were used for SNP and trait. [table 2] The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and uncorrected are obtained from other papers. By comparison, the ICM reached the level of the best precious model for each trait.

Conclusion

This paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.

By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.

The authors also believe this GWAS application is only a start of the usage of implicit causal models. It might could also be used in physics or economics.


Critique

I think this paper is an interesting and novel work. The main contribution of this paper is to create a bridge between the statistical genetics community and the ML community. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.

The neural network used in this paper is a very simple feedforward 2 hidden layer neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.

It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showed some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model whether are better than the previous methods, such as GCAT or LMM.

Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.

Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs. This could be a future work.

References