A universal SNP and small-indel variant caller using deep neural networks

From statwiki
Revision as of 13:36, 28 November 2020 by Iaoellme (talk | contribs)
Jump to: navigation, search


Biological functions are determined by genes, and differences in function are determined by mutants, or alleles, of those genes. Determining novel alleles is very important in understanding the genetic variation within a species. For example, most eye colours are determined by different alleles of the gene OCA2. All animals receive one copy of each gene from each of their parents. Mutations of a gene are classified as either homozygous (both copies are the same) or heterozygous (the two copies are different).

Next generation sequencing is a very popular technique for sequencing, or reading, DNA. Since all genes are encoded as DNA, sequencing is an essential tool for understanding genes. Next generation sequencing works by reading short sections of DNA of length k, called k-mers, and then piecing them together or aligning them to a reference genome. Next generation sequencing is relatively fast and inexpensive, although can randomly misidentify some nucleotides, introducing errors.

The process of variant calling is determining novel alleles from sequencing data (typically next generation sequencing data). Some significant alleles only differ from the “standard” version of a gene by only a single base pair, such as the mutation which causes multiple sclerosis. Therefore it is important to be able to accurately call single nucleotide swaps/polymorphisms (SNPs), insertions, and deletions (indels). Calling SNPs and small indels is technically challenging, since it requires a program to be able to distinguish between truly novel mutations and errors in the sequencing data.

This paper aims to solve the problem of calling SNPs and small indels using a convolutional neural net by casting the reads as images, and classifying whether or not they contain a mutation.


In Figure 1, DeepVariant workflow overview is illustrated.

Figure 1. In all panels, blue boxes represent data and red boxes are processes.

Initially, the NGS reads aligned to a reference genome are scanned for candidate variants which are different sites from the reference genome. The read and reference data are encoded as an image for each candidate variant site. Then, trained CNN can compute the genotype likelihoods, (heterozygous or homozygous) for each of the candidate variants (figure1, left box). To train the CNN for image classification purposes, the DeepVariant machinery makes pileup images for a labeled sample with known genotypes. These labeled images and known genotypes are provided to CNN for training, and stochastic gradient descent algorithm is used to optimize the CNN parameters to maximize genotype prediction accuracy. After the convergence of the model, the final model is frozen to use for calling mutations for other image classification tests (figure1, middle box). For example, in figure 1 (right box), the reference and read bases are encoded into a pileup image at a candidate variant site. CNN using this encoded image computes the genotype likelihoods for the three diploid genotype states of homozygous reference (hom-ref), heterozygous (het) or homozygous alternate (hom-alt). In this example a heterozygous variant call is emitted, as the most probable genotype here is “het”.


Before the sequencing reads can be fed into the classifier, they must be preprocessed. There are many pre-processing steps that are necessary for this algorithm. These steps represent the real novelty in this technique, by transforming the data in a way that allows us to use more common neural network architectures for classification. The preprocessing of the data can be broken into three main phases: the realignment of reads, finding candidate variants, and creating images of the candidate variants.

The realignment of the reads phase of the preprocessing is important in order to ensure the sequences can be properly compared to the reference sequences. First, the sequences are aligned to a reference sequence. Reads that align poorly are grouped with other reads around them to build that section, or haplotype, from scratch. If there is strong evidence that the new version of the haplotype fits the reads well, the reads are re-aligned to it. This process updates the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string, a way to represent the alignment of a sequence to a reference, for each read.

Once the reads are properly aligned, the algorithm then proceeds to finding candidate variants, regions in the DNA sequence that may contain variants. It is these candidate variants that will eventually be passed as input to the neural network. To find these, we need to consider each position in the reference sequence independently. Any unusable reads are filtered at this point. This includes reads that are not aligned properly, ones that are marked as duplicates, those that fail vendor quality checks, or whose mapping quality is less than ten. For each site in the genome, we collect all the remaining reads that overlap that site. The corresponding allele aligned to that site is then determined by decoding the CIGAR string, that was updated in the realignment phase, of each read. The alleles are then classified into one of four categories: reference-matching base, reference-mismatching base, insertion with a specific sequence, or deletion with a specific length, and the number of occurrences of each distinct allele across all reads is counted. Read bases are only included as potential alleles if each base in the allele has a quality score of at least 10.

With candidate variants identified, the last phase of pre-processing is to convert these candidate variants into images representing the data. This allows for the use of well established convolutional neural networks for image classification for this specialized problem. Each colour channel is used to store a different piece of information about a candidate variant. The red channel encodes which base we have (A, G, C, or T), by mapping each base to a particular value. The quality of the read is mapped to the green colour channel. And finally, the blue channel encodes whether or not the reference is on the positive strand of the DNA. Each row of the image represents a read, and each column represents a particular base in that read. The reference strand is repeated for the first five rows of the encoded image, in order to maintain its information after a 5x5 convolution is applied. With the data preprocessing complete, the images can then be passed into the neural network for classification.

Neural Network

The neural network used is a convolutional neural network. Although the full network architecture is not revealed in the paper, there are several details which we can discuss. The architecture of the network is an input layer attached to an adapted Inception v2 ImageNet model with nine partitions. The input layer takes as input the images representing the candidate variants and rescales them to 299x299 pixels. The output layer is a three class Softmax layer initialized with Gaussian random weights with standard deviation 0.001. This final layer is fully connected to the previous layer. The three classes are: homozygous reference (meaning it is not a variant), heterozygous variant, and homozygous variant. The candidate variant is classified into the class with the highest probability. The model is trained using stochastic gradient descent with a weight decay of 0.00004. The training was done in mini-batches, each with 32 images, using a root mean squared (RMS) decay of 0.9. For the multiple sequencing technologies experiments, a single model was trained with a learning rate of 0.0015 and momentum 0.8 for 250,000 update steps. For all other experiments, multiple models were trained, and the one with the highest accuracy on the training set was chosen as the final model. The multiple models stem from using each combination of the possible parameter values for the learning rate (0.00095, 0.001, 0.0015) and momentum (0.8, 0.85, 0.9). These models were trained for 80 hours, or until the training accuracy converged.


DeepVariant was trained using data available from the CEPH (Centre d’Etude du Polymorphisme Humain) female sample NA12878 and was evaluated on the unseen Ashkenazi male sample NA24385. The results were compared with other most commonly used bioinformatics methods, such as the GATK, FreeBayes22, SAMtools23, 16GT24 and Strelka25 (Table 1). For better comparison, the overall accuracy (F1), recall, precision, and numbers of true positives (TP), false negatives (FN) and false positives (FP) are illustrated over the whole genome.

DeepVariant showed highest accuracy and more than 50% fewer errors per genome compared to the next best algorithm.

They also evaluated the same set of algorithms using the synthetic diploid sample CHM1-CHM1326 (Table 2). Results illustrated that, DeepVariant method outperformed all other algorithms for variant calling (SNP and indel) and showed the highest accuracy in terms of F1, Recall, precision and TP.


DeepVariant’s strong performance on human data proves that deep learning is a promising technique for variant calling. Perhaps the most exciting feature of DeepVariant is its simplicity. Unlike other state of the art variant callers, DeepVariant has no knowledge of the sequencing technologies that create the reads, or even the biological processes that introduce mutations. This simplifies the problem of variant calling to preprocessing the reads and training a generic deep learning model. It also suggests that DeepVariant could be significantly improved by tailoring the preprocessing to specific sequencing technologies and/or developing a dedicated CNN architecture for the reads, rather than trying to cast them as images.