# Background

Gene regulation is the process of controlling which genes in a cell's DNA are turned 'on' (expressed) or 'off' (not expressed). By this process functional product such as a protein is created. Even though all the cells of a multicellular organism (e.g., humans) contain the same DNA, different types of cells in that organism may express very different sets of genes. As a result each cell types have distinct functionality. In other words how a cell operates depends upon the genes expressed in that cell. Many factors including ‘Chromatin modification marks’ influence which genes are abundant in that cell.

The function of chromatin is to efficiently wraps DNA around histones into a condensed volume to fit into the nucleus of a cell and protect the DNA structure and sequence during cell division and replication. Different chemical modifications in the histones of the chromatin, known as histone marks, changes spatial arrangement of the condensed DNA structure. Which in turn affects the gene’s expression of the histone mark’s neighboring region. Histone marks can promote (obstruct) the gene to be turned on by making the gene region accessible (restricted). This section of the DNA, where histone marks can potentially have an impact, is known as DNA flanking region or ‘gene region’ which is considered to cover 10k base pair centered at the transcription start site (TSS) (i.e., 5k base pair in each direction). Unlike genetic mutations, histone modifications are reversible [1]. Therefore, understanding influence of histone marks in determining gene regulation can assist in developing drugs for genetic disease.

# Introduction

Revolution in genomic technologies now enables us to profile genome-wide chromatin mark signals. Therefore, biologists can now measure gene expressions and chromatin signals of the ‘gene region’ for different cell types covering whole human genome. The Roadmap Epigenome Project (REMC, publicly available) [2] recently released 2,804 genome-wide datasets of 100 separate “normal” (not diseased) human cells/tissues, among which 166 datasets are gene expression reads and the rest are signal reads of various histone marks. The goal is to understand which histone marks are the most important and how they interact together in gene regulation for each cell type.

Signal reads for histone marks are high-dimensional and spatially structured. Influence of a histone modification mark can be anywhere in the gene region (covering 10k bp centering the TSS). It is important to understand how the impact of the mark on gene expression varies over the gene region. In other words, how histone signals over the gene region impacts the gene expression. There are different types of histone marks in human chromatin that can have an influence on gene regulation. Researchers have found five standard histone proteins. These five histone proteins can be altered in different combinations with different chemical modifications resulting in a large number of distinct histone modification marks. Different histone modification marks can act as a module to interact with each other and influence the gene expression.

This paper proposes an attention-based deep learning model to find how this chromatin factors/ histone modification marks contributes to the gene expression of a particular cell. AttentiveChrome[3] utilizes a hierarchy of multiple LSTM to discover interactions between signals of each histone marks, and learn dependencies among the marks on expressing a gene. The authors included two levels of soft attention mechanism, (1) to attend to the most relevant signals of a histone mark, and (2) to attend to the important marks and their interactions.

## Main Contributions

The contributions of this work can be summarized as follows:

• More accurate predictions than the state-of-the-art baselines.
• Better interpretation than the state-of-the0art methods for visualizing deep learning model.
• Can explain the model's decision by provisioning “what” and “where” it has focused.
• First attention based deep learning method for a problem of molecular biology.

# Previous Works

Previous machine learning approaches to address this issue either (1) failed to model the spatial dependencies among the marks, or (2) required additional feature analysis.

# AttentiveChrome: Model Formulation

The authors proposed an end-to-end architecture which has the ability to simultaneously attend and predict. This method incorporates recurrent neural networks (RNN) composed of LSTM units to model the sequential spatial dependencies of the gene regions and predict gene expression level from The embedding vector, $h_t$, output of an LSTM module encodes the learned representation of the feature dependencies from the time step 0 to $t$. For this task, each bin position of the gene region is considered as a time step.

The proposed AttentiveChrome framework contains following 5 important modules:

• Bin-level LSTM encoder encoding the bin positions of the gene region (one for each HM mark)
• Bin-level $\alpha$-Attention across all bin positions (one for each HM mark)
• HM-level LSTM encoder (one encoder encoding all HM marks)
• HM-level $\beta$-Attention among all HM marks (one)
• The final classification module

Figure 1 (Supplementary Figure 2) presents the overview of the proposed AttentiveChrome framework.

## Input and Output

Each dataset contains the gene expression labels and the histone signal reads for one specific cell type. The authors evaluated AttentiveChrome on 56 different cell types. For each mark, we have a feature/input vector containing the signals reads surrounding the gene’s TSS position (gene region) for the histone mark. The label of this input vector denotes the gene expression of the specific gene. This study considers binary labeling where $+1$ denotes gene is expressed (on) and $-1$ denotes that the gene is not expressed (off). Each histone marks will have one feature vector for each gene. The authors integrates the feature inputs and outputs of their previous work DeepChrome [4] into this research. The input feature is represented by a matrix $\textbf{X}$ of size $M \times T$, where $M$ is the number of HM marks considered in the input, and $T$ is the number of bin positions taken into account to represent the gene region. The $j^{th}$ row of the vector $\textbf{X}$, $x_j$, represents sequentially structured signals from the $j^{th}$ HM mark, where $j\in \{1, \cdots, M\}$. Therefore, $x_j^t$, in the matrix $\textbf{X}$ represents the value from the $t^{th}$ bin belonging to the $j^{th}$ HM mark, where $t\in \{1, \cdots, T\}$. If the training set contains $N_{tr}$ labeled pairs, the $n^{th}$ is specified as $( X^n, y^n)$, where $X^n$ is a matrix of size $M \times T$ and $y^n \in \{ -1, +1 \}$ is the binary label, and $n \in \{ 1, \cdots, N_{tr} \}$.

Figure 2 exhibits the input feature, and the output of AttentiveChrome for a particular gene (one sample).

## Bin-Level Encoder (one LSTM for each HM)

The sequentially ordered elements (each element actually is a bin position) of the gene region of $n^{th}$ gene is represented by the $j_{th}$ row vector $x^j$. The authors considered each bin position as a time step for LSTM. This study incorporates bidirectional LSTM to model the overall dependencies among a total of $T$ bin positions in the gene region. The bidirectional LSTM contains two LSTMs

• A forward LSTM, $\overrightarrow{LSTM_j}$, to model $x^j$ from $x_1^j$ to $x_T^j$, which outputs the embedding vector $\overrightarrow{h^t_j}$, of size $d$ for each bin $t$
• A reverse LSTM, $\overleftarrow{LSTM_j}$, to model $x^j$ from $x_T^j$ to $x_1^j$, which outputs the embedding vector $\overleftarrow{h^j_t}$, of size $d$ for each bin $t$

The final output of this layer, embedding vector at $t^{th}$ bin for the $j^{th}$ HM, $h^j_t$, of size $d$, is obtained by concatenating the two vectors from the both directions. Therefore, $h^j_t = [ \overrightarrow{h^j_t}, \overleftarrow{h^j_t}]$.

## Bin-Level $\alpha$-attention

Each bin contributes differently in the encoding of the entire $j^{th}$ mark. To highlight the most important bins for prediction a soft attention weight vector $\alpha^j$ of size $T$ is learned for each $j$. To calculated the soft weight $\alpha^j_t$, for each $t$, the embedding vectors $\{h^j_1, \cdots, h^j_t \}$ of all the bins are utilized. The following equation is used:

$\alpha^j_t = \frac{exp(\textbf{W}_b h^j_t)}{\sum_{i=1}^T{exp(\textbf{W}_b h^j_i)}}$

The parameter $W_b$ is learned alongside during the process. Therefore, the $j^{th}$ HM mark can be represented by $m^j = \sum_{t=1}^T{\alpha^j_t \times h^j_t}$. Here, $h^j_t$ is the embedding vector and $\alpha^t_j$ is the importance weight of the $t^{th}$ bin in the representation of the $j^{th}$ HM mark. Intuitively $\textbf{W}_b$ will learn the cell type.

## HM-level Encoder (one LSTM)

Studies observed that HMs work cooperatively to provoke or subdue gene expression [5]. The HM-level encoder utilizes one bidirectional LSTM to capture this relationship between the HMs. To formulate the sequential dependency a random sequence is imagined as the authors did not find influence of any specific ordering of the HMs. The representation $m_j$of the $j^{th}$ HM, $HM_j$, which is calculated from the bin-level attention layer, is the input of this step. This set based encoder outputs an embedding vector $s^j$ of size $d’$, which is the encoding for the $j^{th}$ HM.

$s^j = [ \overrightarrow{LSTM_s}(m_j), \overleftarrow{LSTM_s}(m_j) ]$

The dependencies between $j^{th}$ HM and the other HM marks are encoded in $s^j$, whereas $m^j$ from the previous step encodes the bin dependencies of the $j^{th}$ HM.

HM-Level $\beta$-attention This second soft attention level finds the important HM marks for classifying a gene’s expression by learning the importance weights, $\beta_j$, for each $HM_j$, where $j \in \{ 1, \cdots, M \}$. The equation is

$\beta^j = \frac{exp(\textbf{W}_s s^j)}{\sum_{i=1}^M{exp(\textbf{W}_s s^j)}}$

The HM-level context parameter $\textbf{W}_s$ is trained jointly in the process. Intuitively $\textbf{W}_s$ learns how the HMs are significant for a cell type. Finally the entire gene region is encoded in a hidden representation $\textbf{v}$, using the weighted sum of the embedding of all HM marks.

$\textbf{v} = \sum_{j=1}^MT{\beta^j \times s^j}$

## End-to-end training

The embedding vector $\textbf{v}$ is fed to a simple classification module, $f(\textbf{v}) =$softmax$(\textbf{W}_c\textbf{v}+b_c)$, where $\textbf{W}_c$, and $b_c$ are learnable parameters. The output is the probability of gene expression being high (expressed) or low (suppressed). The whole model including the attention modules are differentiable. Thus backpropagation can perform end-to-end learning trivially. Negative log-likelihood loss function is minimized in the learning.

# Reference

[1] Andrew J Bannister and Tony Kouzarides. Regulation of chromatin by histone modifications. Cell research, 21(3):381–395, 2011.

[2] Anshul Kundaje, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, Zhizhuo Zhang, Jianrong Wang, Michael J Ziller, et al. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, 2015.

[3] Singh, Ritambhara, et al. "Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin." Advances in Neural Information Processing Systems. 2017.

[4] Ritambhara Singh, Jack Lanchantin, Gabriel Robins, and Yanjun Qi. Deepchrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, 2016.

[5] Joanna Boros, Nausica Arnoult, Vincent Stroobant, Jean-François Collet, and Anabelle Decottignies. Polycomb repressive complex 2 and h3k27me3 cooperate with h3k9 methylation to maintain heterochromatin protein 1α at chromatin. Molecular and cellular biology, 34(19):3662–3674, 2014.

[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.