stat441F18/TCNLM

From statwiki
Revision as of 13:27, 5 November 2018 by H343li (talk | contribs)
Jump to navigation Jump to search

Topic Compositional Neural Language Model (TCNLM) simultaneously captures both the global semantic meaning and the local word-ordering structure in a document. A common TCNLM incorporates fundamental components of both a neural topic model (NTM) and a Mixture-of-Experts (MoE) language model. The latent topics learned within a variational autoencoder framework, coupled with the probability of topic usage, are further trained in a MoE model. (Insert figure here)

TCNLM networks are well-suited for topic classification and sentence generation on a given topic. The combination of latent topics, weighted by the topic-usage probabilities, yields an effective prediction for the sentences. TCNLMs were also developed to address the incapability of RNN-based neural language models in capturing broad document context. After learning the global semantic, the probability of each learned latent topic is used to learn the local structure of a word sequence.

Presented by

  • Yan Yu Chen
  • Qisi Deng
  • Hengxin Li
  • Bochao Zhang

Model Architecture

Topic Model

A topic model is a probabilistic model that unveils the hidden semantic structures of a document. Topic modelling follows the philosophy that particular words will appear more frequently than others in certain topics.

LDA

A common example of a topic model would be latent Dirichlet allocation (LDA), which assumes each document contains various topics but with different proportion. LDA parameterizes topic distribution by the Dirichlet distribution and calculates the marginal likelihood as (insert formula).

Neural Topic Model

The neural topic model takes in a bag-of-words representation of a document to predict the topic distribution of the document in wish to identify the global semantic meaning of documents.

The variables are defined as the following:

  • [math]\displaystyle{ d }[/math] be document with [math]\displaystyle{ D }[/math] distinct vocabulary
  • [math]\displaystyle{ \boldsymbol{d} \in \mathbb{Z}_+^D }[/math] be the bag-of-words representation of document d (each element of d is the count of the number of times the corresponding word appears in d),
  • [math]\displaystyle{ \boldsymbol{t} }[/math] be the topic proportion for document d
  • [math]\displaystyle{ T }[/math] be the number of topics
  • [math]\displaystyle{ z_n }[/math] be the topic assignment for word [math]\displaystyle{ w_n }[/math]
  • [math]\displaystyle{ \boldsymbol{\beta} = \{\beta_1, \beta_2, \dots, \beta_T \} }[/math] be the transition matrix from the topic distribution trained in the decoder where [math]\displaystyle{ \beta_i \in \mathbb{R}^D }[/math]is the topic distribution over the i-th word in the corresponding [math]\displaystyle{ \boldsymbol{d} }[/math].

Similar to LDA, the neural topic model parameterized the multinational document topic distribution. However, it uses a Gaussian random vector by passing it through a softmax function. The generative process in the following:

Where are trainable parameters.

The marginal likelihood for document d is then calculated as the following:

Re-Parameritization Trick

In order to build an unbiased and low-variance gradient estimator for the variational distribution, TCNLM uses the re-parameterization trick. The update for the parameters is derived from variational lower bound will be discussed in the section model inference.

Diversity Regularizer

One of the problems that many topic models encounter is the redundancy in the inferred topics. Therefore, The TCNLM uses a diversity regularizer to reduce it. The idea is to regularize the row-wise distance between each paired topics. First, we measure the distance between pair of topics with . Then, mean angle of all pairs of T topics is , and variance is . Finally, we identify the topic diversity regularization as which will be used in the model inference.

Language Model

A typical Language Model aims to define the conditional probability of each word \y_{m} given all the preceding input \y_{1},...,\y_{m-1}, connected through the hidden state hm.

RNN (LSTM)

Recurrent Neural Networks (RNNs) capture the temporal relationship among input information and output a sequence of input-dependent data. Comparing to traditional feedforward neural networks, RNNs maintains internal memory by looping over previous information inside each network. For its distinctive design, RNNs have shortcomings when learning from long-term memory as a result of the zero gradients in back-propagation, which prohibits states distant in time from contributing to the output of current state. Long short-term Memory (LSTM) or Gated Recurrent Unit (GRU) are variations of RNNs that were designed to address the vanishing gradient issue.

Neural Language Model

In Topic Compositional Neural Language Model, word choices and order structures are highly motivated by the topic distribution of a document, and each word has its corresponding topic distribution. A ‘Mixture of Expert’ language model is proposed, where each ‘Expert’ itself is a topic specific LSTM with trained parameters corresponding to the latent topic vector inherited from Neural Topic Model.

In such a model, the generation of words can be considered as a weighted average of topic proportion and predictions resulted from each ‘expert’ model, with latent topic vector served as proportion weights. Without loss of generality, ‘Mixture-of-Expert’ is first illustrated with a simple RNN cell, which then generalized into the proposed LSTM.

TCNLM extends weight matrices of each RNN unit to be topic-dependent due to the existence of topic assignment for individual word, which implicitly defines an ensemble of T language models. Specifically, T experts are jointly trained as follows:

(9)

(10)

(11)

Where (introducing notations)

A matrix decomposition technique is applied onto W and U to further reduce the number of model parameters, which is each a multiplication of three terms. This method is enlightened by () and () for semantic concept detection RNN. (element-wise product)

(12)

To generalize into LSTM, TCNLM requires four sets of parameters for input gate, forget gate, output gate, and memory respectively. Recall a typical LSTM cell, (insert an image depicting LSTM) model can be parametrized as follows:

(13)

(14)

(15)

Model Inference

Model Comparison and Evaluation

Model Comparison

In this paper, TCNLM incorporates fundamental components of both an NTM and a MoE Language Model. Compared with other topic models and language models:

  1. The recent work of Miao et al. (2017) employs variational inference to train topic models. In contrast, TCNLM enforces the neural network not only modeling documents as bag-of-words but also transferring the inferred topic knowledge to a language model for word-sequence generation.
  2. Dieng et al. (2016) and Ahn et al. (2016) use a hybrid model combining the predicted word distribution given by both a topic model and a standard RNNLM. Distinct from this approach, TCNLM learns the topic model and the language model jointly under the VAE framework, allowing an efficient end-to-end training process. Further, the topic information is used as guidance for a MoE model design. Under TCNLM's factorization method, the model can yield boosted performance efficiently.
  3. TCNLM is similar to Gan et al. (2016). However, Gan et al. (2016) use a two-step pipline, first learning a multi-label classifier on a group of pre-defined image tags, and then generating image captions conditioned on them. In comparison, TCNLM jointly learns a topic model and a language model and focuses on the language modeling task.

Model Evaluation

Using the datasets APNEWS, IMDB and BNC. APNEWS, which is a collection of Associated Press news articles from 2009 to 2016, to do the model evaluation, the paper gets the following result:

  • In the evaluation of Language Model:
  1. All the topic-enrolled methods outperform the basic-LSTM model, indicating the effectiveness of incorporating global semantic topic information.
  2. TCNLM performs the best across all datasets, and the trend keeps improving with the increase of topic numbers.
  3. The improved performance of TCNLM over LCLM implies that encoding the document context into meaningful topics provides a better way to improve the language model compared with using the extra context words directly.
  4. The margin between LDA+LSTM/Topic-RNN and our TCNLM indicates that our model supplies a more efficient way to utilize the topic information through the joint variational learning framework to implicitly train an ensemble model.
  • In the evaluation of Topic Model:
  1. TCNLM achieve the best coherence performance over APNEWS and IMDB and are relatively competitive with LDA on BNC.
  2. We also observe that a larger model may result in a slightly worse coherence performance. One possible explanation is that a larger language model may have more impact on the topic model, and the inherited stronger sequential information may be harmful to the coherence measurement.
  3. Additionally, the advantage of TCNLM over Topic-RNN indicates that TCNLM supplies more powerful topic guidance.

Extensions