summary

From statwiki
Revision as of 21:02, 14 March 2018 by W392zhan (talk | contribs)
Jump to navigation Jump to search

XGBoost: A Scalable Tree Boosting System

Presented by

Jiang, Cong

Song, Ziwei

Ye, Zhaoshan

Zhang, Wenling


Introduction

Model

Theory

[math]\displaystyle{ \mathbf{z}_n = \sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N }[/math]

where [math]\displaystyle{ \mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}} }[/math] is the weight matrix and [math]\displaystyle{ \mathbf{b}_{n} \in \mathbb{R}^{p_n} }[/math] is the bias vector associated with the [math]\displaystyle{ n }[/math]th layer in the network.

The element-wise vector function [math]\displaystyle{ \sigma_n\left(\cdot\right) }[/math] is sigmoid-like for each component in its domain, outputing a value that ranges in [math]\displaystyle{ [0,1] }[/math]. Typically, the functions [math]\displaystyle{ \left\{\sigma_n\left(\cdot\right)\right\} }[/math] are the same for [math]\displaystyle{ n \lt N }[/math], but the final output [math]\displaystyle{ \sigma_N(\cdot) }[/math] depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple [math]\displaystyle{ (\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\}) }[/math].

A sample network for [math]\displaystyle{ N = 2 }[/math] is depicted below: a graph of an ANN with [math]\displaystyle{ N = 2 }[/math], where the vertices represent the vectors [math]\displaystyle{ \left\{\mathbf{z}_n\right\}_{n=0}^2 }[/math] in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of [math]\displaystyle{ \mathbf{W}_1 }[/math] and [math]\displaystyle{ \mathbf{W}_2 }[/math], such that they match the vectors [math]\displaystyle{ \mathbf{z}_0 }[/math] and [math]\displaystyle{ \mathbf{z}_1 }[/math].

File:ann.png
Fig 1. Graph of an ANN with [math]\displaystyle{ N = 2 }[/math], where the vertices represent the vectors [math]\displaystyle{ \left\{\mathbf{z}_n\right\}_{n=0}^2 }[/math] in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of [math]\displaystyle{ \mathbf{W}_1 }[/math] and [math]\displaystyle{ \mathbf{W}_2 }[/math], such that they match the vectors [math]\displaystyle{ \mathbf{z}_0 }[/math] and [math]\displaystyle{ \mathbf{z}_1 }[/math].

A Recurrent Neural Network is a generalization of an ANN for a sequence of inputs [math]\displaystyle{ \left\{\mathbf{z}_0^{[t]}\right\} }[/math] where [math]\displaystyle{ t \in \left\{1,\ldots,T\right\} }[/math] such that there are recurrent connections between the intermediary vectors [math]\displaystyle{ \left\{\mathbf{z}_n\right\} }[/math] for different so-called time steps. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: [math]\displaystyle{ \left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} = \left\{\text{pass}, \text{the}, \text{sauce}\right\} }[/math]. In a language modelling problem for predictive text, the probability of obtaining [math]\displaystyle{ \mathbf{z}_0^{[3]} }[/math] is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for [math]\displaystyle{ 1 \leq n \leq N }[/math] and [math]\displaystyle{ t \gt 1 }[/math] to produce the recurrent update rule

[math]\displaystyle{ \mathbf{z}_n^{[t]} = \sigma_n\left( \mathbf{b}_{n} + \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} + \mathbf{R}_n\mathbf{z}_{n}^{[t-1]} \right) }[/math]

where [math]\displaystyle{ \mathbf{R}_n \in \mathbb{R}^{p_n \times p_n} }[/math] is the recurrence matrix that relates the [math]\displaystyle{ n }[/math]th layer’s output for item [math]\displaystyle{ t }[/math] to its previous output for item [math]\displaystyle{ t-1 }[/math]. The network architecture for a single layer [math]\displaystyle{ n }[/math] at step [math]\displaystyle{ t }[/math] is pictured below. This is a schematic of an RNN layer [math]\displaystyle{ n }[/math] at step [math]\displaystyle{ t }[/math] with recurrence on the output of [math]\displaystyle{ \mathbf{z}_n^{[t-1]} }[/math], with the dimensions of the matrices [math]\displaystyle{ \mathbf{R}_{n} }[/math] and [math]\displaystyle{ \mathbf{W}_{n} }[/math] pictured.

File:rnn.png
Fig 2. Schematic of an RNN layer [math]\displaystyle{ n }[/math] at step [math]\displaystyle{ t }[/math] with recurrence on the output of [math]\displaystyle{ \mathbf{z}_n^{[t-1]} }[/math], with the dimensions of the matrices [math]\displaystyle{ \mathbf{R}_{n} }[/math] and [math]\displaystyle{ \mathbf{W}_{n} }[/math] pictured.

Section

The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input [math]\displaystyle{ \mathbf{z}_0^{[t]} }[/math], and follows the update rule

[math]\displaystyle{ \mathbf{z}_{N}^{[t]} = \sigma_n\left( \mathbf{b}_N + \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]} \right) }[/math]

where [math]\displaystyle{ \mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}} }[/math] denotes the weight matrix between layer [math]\displaystyle{ n' }[/math] and [math]\displaystyle{ N }[/math].

The layers 2 through [math]\displaystyle{ N-1 }[/math] have additional connections to [math]\displaystyle{ \mathbf{z}_0^{[t]} }[/math] as

[math]\displaystyle{ \mathbf{z}_n^{[t]} = \sigma_n\left( \mathbf{b}_{n} + \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} + \mathbf{W}_{n,0}\mathbf{z}_0^{[t]} + \mathbf{R}_n\mathbf{z}_{n}^{[t-1]} \right), }[/math]

where, again, [math]\displaystyle{ \mathbf{W}_{n,n'} }[/math] must be of size [math]\displaystyle{ \mathbb{R}^{p_n\times p_{n'}} }[/math]. The first layer has the typical RNN input rule as before,

[math]\displaystyle{ \mathbf{z}_{1}^{[t]} = \sigma_1\left( \mathbf{b}_{1} + \mathbf{W}_{1}\mathbf{z}_{0}^{[t]} + \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]} \right). }[/math]

Sample format

Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm> Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. </ref> Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input, because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks, whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave> Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013). </ref>)


Where [math]\displaystyle{ \,S }[/math] is the base/source sentence, [math]\displaystyle{ \,T }[/math] is the paired translated sentence and [math]\displaystyle{ \,T_r }[/math] is the total training set. This objective function is to maximize the log probability of a correct translation [math]\displaystyle{ \,T }[/math] given the base/source sentence [math]\displaystyle{ \,S }[/math] over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:

[math]\displaystyle{ \hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S) }[/math]


It has been showed that Long Short-Term Memory recurrent neural networks have the ability to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction <ref name=grave> Reference </ref>.


Training and Results

Training Method

Results

The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.

Active Voice: I ate an apple.

Passive Voice: The apple was eaten by me.


In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].

Some developments

Open questions

  1. Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).
  2. For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?

Section

Criticisms

Source