Convolutional Sequence to Sequence Learning: Difference between revisions

From statwiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
== Introduction==
= Introduction=


Sequence to sequence learning has been used to solve many tasks such as machine translation, speech recognition and text summarization task. Most of the past models employ RNNs for this problem with a bidirectional RNNs with soft attention being the dominant approach.
Sequence to sequence learning has been used to solve many tasks such as machine translation, speech recognition and text summarization task. Most of the past models employ RNNs for this problem with a bidirectional RNNs with soft attention being the dominant approach.
Line 9: Line 9:
In this paper the authors introduce an architecture for sequence learning based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. The use of gated linear units eases gradient propagation and  equiping each decoder layer with a separate attention module. They outperform the accuracy of the deep LSTM setup of Wu et al. (2016) and is now the state of the art model for neural machine translation.
In this paper the authors introduce an architecture for sequence learning based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. The use of gated linear units eases gradient propagation and  equiping each decoder layer with a separate attention module. They outperform the accuracy of the deep LSTM setup of Wu et al. (2016) and is now the state of the art model for neural machine translation.


== Related Work ==
= Related Work =


Bradbury et al.(2016) introduce a quasi-recurrent neural network (QRNNs), an approach to neural sequence modelling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. They use QRNNs for sentiment classification, language modelling and aslo briefly describe about an architecture consisting of QRNNs for sequence to sequence learning.
Bradbury et al.(2016) introduce a quasi-recurrent neural network (QRNNs), an approach to neural sequence modelling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. They use QRNNs for sentiment classification, language modelling and aslo briefly describe about an architecture consisting of QRNNs for sequence to sequence learning.
Line 17: Line 17:
However, none of the above approaches has been demonstrated improvements over state of the art results on large benchmark datasets. Gated convolutions have been previously explored for machine translation by Meng et al. (2015) but their evaluation was restricted to a small dataset. The author himself has explored architectures which used CNN but only in the encoder, the decoder part was still Recurrent.
However, none of the above approaches has been demonstrated improvements over state of the art results on large benchmark datasets. Gated convolutions have been previously explored for machine translation by Meng et al. (2015) but their evaluation was restricted to a small dataset. The author himself has explored architectures which used CNN but only in the encoder, the decoder part was still Recurrent.


== Convolutional Architecture==
= Convolutional Architecture=
== Experimental Setup ==  
= Experimental Setup =
== Results ==
= Results =

Revision as of 15:00, 30 October 2017

Introduction

Sequence to sequence learning has been used to solve many tasks such as machine translation, speech recognition and text summarization task. Most of the past models employ RNNs for this problem with a bidirectional RNNs with soft attention being the dominant approach. On contrary CNN have not been used for this tasks even though they have a lot of advantages

  • Compared to recurrent layers, convolutions create representations for fixed size contexts, however, the effective context size of the network can easily be made larger by stacking several layers on top of each other. This allows to precisely control the maximum length of dependencies to be modeled.
  • Convolutional networks do not depend on the computations of the previous time step and therefore allow parallelization over every element in a sequence. This contrasts with RNNs which maintain a hidden state of the entire past that prevents parallel computation within a sequence.
  • Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

In this paper the authors introduce an architecture for sequence learning based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training to better exploit the GPU hardware and optimization is easier since the number of non-linearities is fixed and independent of the input length. The use of gated linear units eases gradient propagation and equiping each decoder layer with a separate attention module. They outperform the accuracy of the deep LSTM setup of Wu et al. (2016) and is now the state of the art model for neural machine translation.

Related Work

Bradbury et al.(2016) introduce a quasi-recurrent neural network (QRNNs), an approach to neural sequence modelling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels. They use QRNNs for sentiment classification, language modelling and aslo briefly describe about an architecture consisting of QRNNs for sequence to sequence learning.

Kalchbrenner et al.(2016) introduce an architecture called "bytenet". The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence. This network has two core properties: it runs in time that is linear in the length of the sequences and it sidesteps the need for excessive memorization.

However, none of the above approaches has been demonstrated improvements over state of the art results on large benchmark datasets. Gated convolutions have been previously explored for machine translation by Meng et al. (2015) but their evaluation was restricted to a small dataset. The author himself has explored architectures which used CNN but only in the encoder, the decoder part was still Recurrent.

Convolutional Architecture

Experimental Setup

Results