# Difference between revisions of "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin"

(→Input and Output) |
|||

Line 28: | Line 28: | ||

* First attention based deep learning method for a problem of molecular biology. | * First attention based deep learning method for a problem of molecular biology. | ||

− | = Input and Output = | + | = AttentiveChrome: Formulation of the Model = |

+ | |||

+ | == Input and Output == | ||

+ | |||

+ | Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where <math> +1 </math> denotes gene is expressed (on) and <math> -1 </math> denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix <math> X </math> of size <math> M \times T </math>, where <math> M </math> is the number of HM marks considered in the input, and <math> T </math> is the number of bin positions taken into account to represent the gene region. The <math> j^{th} </math> row of the vector <math> X </math>, <math> x_j</math>, represents sequentially structured signals from the <math> j^{th} </math> HM mark, where <math> j\in \{1, \cdots, M\} </math>. Therefore, <math> x_j^t</math>, in the matrix <math> X </math> represents the value from the <math> t^{th}</math> bin belonging to the <math> j^{th} </math> HM mark, where <math> t\in \{1, \cdots, T\} </math>. If the training set contains <math>N_{tr} </math> labeled pairs, the <math> n^{th} </math> is specified as <math> (X^n, y^n)</math>, where <math> X^n </math> is a matrix of size <math> M \times T </math> and <math> y^n \in \{ -1, +1 \} </math> is the binary label, and <math> n \in \{ 1, \cdots, N_{tr} \} </math>. | ||

= Reference = | = Reference = |

## Revision as of 18:52, 10 November 2018

## Contents

# Background

Gene regulation is the process of controlling which genes in a cell's DNA are turned 'on' (expressed) or 'off' (not expressed). By this process functional product such as a protein is created. Even though all the cells of a multicellular organism (e.g., humans) contain the same DNA, different types of cells in that organism may express very different sets of genes. As a result each cell types have distinct functionality. In other words how a cell operates depends upon the genes expressed in that cell. Many factors including ‘Chromatin modification marks’ influence which genes are abundant in that cell.

The function of chromatin is to efficiently wraps DNA around histones into a condensed volume to fit into the nucleus of a cell and protect the DNA structure and sequence during cell division and replication. Different chemical modifications in the histones of the chromatin, known as histone marks, changes spatial arrangement of the condensed DNA structure. Which in turn affects the gene’s expression of the histone mark’s neighboring region. Histone marks can promote (obstruct) the gene to be turned on by making the gene region accessible (restricted). This section of the DNA, where histone marks can potentially have an impact, is known as DNA flanking region or ‘gene region’ which is considered to cover 10k base pair centered at the transcription start site (TSS) (i.e., 5k base pair in each direction). Unlike genetic mutations, histone modifications are reversible [1]. Therefore, understanding influence of histone marks in determining gene regulation can assist in developing drugs for genetic disease.

# Introduction

Revolution in genomic technologies now enables us to profile genome wide chromatin mark signals. Therefore, biologists can now measure gene expressions and chromatin signals of the ‘gene region’ for different cell types covering whole human genome. The Roadmap Epigenome Project (REMC, publicly available) [2] recently released 2,804 genome-wide datasets of 100 separate “normal” (not diseased) human cells/tissues, among which 166 datasets are gene expression reads and the rest are signal reads of various histone marks. The goal is to understand which histone marks are the most important and how the interact together in gene regulation for each cell type.

Signal reads for histone marks are high-dimensional and spatially structured. Influence of a histone modification mark can be anywhere in the gene region (covering 10k bp centering the TSS). It is important to understand how the impact of the mark on gene expression varies over the gene region. In other words, how histone signals over the gene region impacts the gene expression. There are different types of histone marks in human chromatin that can have influence on gene regulation. Researchers have found five standard histone proteins. These five histone proteins can be altered in different combinations with different chemical modifications resulting in a large number of distinct histone modification marks. Different histone modification marks can act as a module to interact with each other and influence the gene expression.

This paper proposes an attention based deep learning model to find how this chromatin factors/ histone modification marks contributes in gene expression of a particular cell. AttentiveChrome[3] utilizes a hierarchy of multiple LSTM to discover interactions between signals of each histone marks, and learn dependencies among the marks on expressing a gene. The authors included two levels of soft attention mechanism, (1) to attend to the most relevant signals of a histone mark, and (2) to attend to the important marks and their interactions. The proposed AttentiveChrome framework contains following 5 important modules:

- Bin-level LSTM encoder encoding the bin positions of the gene region (one for each HM mark)
- Bin-level [math] \alpha [/math]-Attention across all bin positions (one for each HM mark)
- HM-level LSTM encoder (one encoder encoding all HM marks)
- HM-level [math] \beta [/math]-Attention among all HM marks (one)

## Main Contributions

The contributions of this work can be summarized as follows:

- More accurate predictions than the state-of-the-art baselines.
- Better interpretation than the state-of-the0art methods for visualizing deep learning model.
- Can explain the models decision by provisioning “what” and “where” it has focused.
- First attention based deep learning method for a problem of molecular biology.

# AttentiveChrome: Formulation of the Model

## Input and Output

Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where [math] +1 [/math] denotes gene is expressed (on) and [math] -1 [/math] denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix [math] X [/math] of size [math] M \times T [/math], where [math] M [/math] is the number of HM marks considered in the input, and [math] T [/math] is the number of bin positions taken into account to represent the gene region. The [math] j^{th} [/math] row of the vector [math] X [/math], [math] x_j[/math], represents sequentially structured signals from the [math] j^{th} [/math] HM mark, where [math] j\in \{1, \cdots, M\} [/math]. Therefore, [math] x_j^t[/math], in the matrix [math] X [/math] represents the value from the [math] t^{th}[/math] bin belonging to the [math] j^{th} [/math] HM mark, where [math] t\in \{1, \cdots, T\} [/math]. If the training set contains [math]N_{tr} [/math] labeled pairs, the [math] n^{th} [/math] is specified as [math] (X^n, y^n)[/math], where [math] X^n [/math] is a matrix of size [math] M \times T [/math] and [math] y^n \in \{ -1, +1 \} [/math] is the binary label, and [math] n \in \{ 1, \cdots, N_{tr} \} [/math].

# Reference

[1] Andrew J Bannister and Tony Kouzarides. Regulation of chromatin by histone modifications. Cell research, 21(3):381–395, 2011.

[2] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.

[3] Singh, Ritambhara, et al. "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin." Advances in Neural Information Processing Systems. 2017.