Revision as of 22:02, 14 March 2018

XGBoost: A Scalable Tree Boosting System

Presented by

Jiang, Cong

Song, Ziwei

Ye, Zhaoshan

Zhang, Wenling

Introduction

Model

Theory

[math]\displaystyle{ \mathbf{z}_n = \sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N }[/math]

where [math]\displaystyle{ \mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}} }[/math] is the weight matrix and [math]\displaystyle{ \mathbf{b}_{n} \in \mathbb{R}^{p_n} }[/math] is the bias vector associated with the [math]\displaystyle{ n }[/math]th layer in the network.

The element-wise vector function [math]\displaystyle{ \sigma_n\left(\cdot\right) }[/math] is sigmoid-like for each component in its domain, outputing a value that ranges in [math]\displaystyle{ [0,1] }[/math]. Typically, the functions [math]\displaystyle{ \left\{\sigma_n\left(\cdot\right)\right\} }[/math] are the same for [math]\displaystyle{ n \lt N }[/math], but the final output [math]\displaystyle{ \sigma_N(\cdot) }[/math] depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple [math]\displaystyle{ (\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\}) }[/math].

A sample network for [math]\displaystyle{ N = 2 }[/math] is depicted below: a graph of an ANN with [math]\displaystyle{ N = 2 }[/math], where the vertices represent the vectors [math]\displaystyle{ \left\{\mathbf{z}_n\right\}_{n=0}^2 }[/math] in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of [math]\displaystyle{ \mathbf{W}_1 }[/math] and [math]\displaystyle{ \mathbf{W}_2 }[/math], such that they match the vectors [math]\displaystyle{ \mathbf{z}_0 }[/math] and [math]\displaystyle{ \mathbf{z}_1 }[/math].

File:ann.png

Fig 1. Graph of an ANN with [math]\displaystyle{ N = 2 }[/math], where the vertices represent the vectors [math]\displaystyle{ \left\{\mathbf{z}_n\right\}_{n=0}^2 }[/math] in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of [math]\displaystyle{ \mathbf{W}_1 }[/math] and [math]\displaystyle{ \mathbf{W}_2 }[/math], such that they match the vectors [math]\displaystyle{ \mathbf{z}_0 }[/math] and [math]\displaystyle{ \mathbf{z}_1 }[/math].

A Recurrent Neural Network is a generalization of an ANN for a sequence of inputs [math]\displaystyle{ \left\{\mathbf{z}_0^{[t]}\right\} }[/math] where [math]\displaystyle{ t \in \left\{1,\ldots,T\right\} }[/math] such that there are recurrent connections between the intermediary vectors [math]\displaystyle{ \left\{\mathbf{z}_n\right\} }[/math] for different so-called time steps. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: [math]\displaystyle{ \left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} = \left\{\text{pass}, \text{the}, \text{sauce}\right\} }[/math]. In a language modelling problem for predictive text, the probability of obtaining [math]\displaystyle{ \mathbf{z}_0^{[3]} }[/math] is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for [math]\displaystyle{ 1 \leq n \leq N }[/math] and [math]\displaystyle{ t \gt 1 }[/math] to produce the recurrent update rule

[math]\displaystyle{ \mathbf{z}_n^{[t]} = \sigma_n\left( \mathbf{b}_{n} + \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} + \mathbf{R}_n\mathbf{z}_{n}^{[t-1]} \right) }[/math]

where [math]\displaystyle{ \mathbf{R}_n \in \mathbb{R}^{p_n \times p_n} }[/math] is the recurrence matrix that relates the [math]\displaystyle{ n }[/math]th layer’s output for item [math]\displaystyle{ t }[/math] to its previous output for item [math]\displaystyle{ t-1 }[/math]. The network architecture for a single layer [math]\displaystyle{ n }[/math] at step [math]\displaystyle{ t }[/math] is pictured below. This is a schematic of an RNN layer [math]\displaystyle{ n }[/math] at step [math]\displaystyle{ t }[/math] with recurrence on the output of [math]\displaystyle{ \mathbf{z}_n^{[t-1]} }[/math], with the dimensions of the matrices [math]\displaystyle{ \mathbf{R}_{n} }[/math] and [math]\displaystyle{ \mathbf{W}_{n} }[/math] pictured.

File:rnn.png

Fig 2. Schematic of an RNN layer [math]\displaystyle{ n }[/math] at step [math]\displaystyle{ t }[/math] with recurrence on the output of [math]\displaystyle{ \mathbf{z}_n^{[t-1]} }[/math], with the dimensions of the matrices [math]\displaystyle{ \mathbf{R}_{n} }[/math] and [math]\displaystyle{ \mathbf{W}_{n} }[/math] pictured.

Section

The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input [math]\displaystyle{ \mathbf{z}_0^{[t]} }[/math], and follows the update rule

[math]\displaystyle{ \mathbf{z}_{N}^{[t]} = \sigma_n\left( \mathbf{b}_N + \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]} \right) }[/math]

where [math]\displaystyle{ \mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}} }[/math] denotes the weight matrix between layer [math]\displaystyle{ n' }[/math] and [math]\displaystyle{ N }[/math].

The layers 2 through [math]\displaystyle{ N-1 }[/math] have additional connections to [math]\displaystyle{ \mathbf{z}_0^{[t]} }[/math] as

[math]\displaystyle{ \mathbf{z}_n^{[t]} = \sigma_n\left( \mathbf{b}_{n} + \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]} + \mathbf{W}_{n,0}\mathbf{z}_0^{[t]} + \mathbf{R}_n\mathbf{z}_{n}^{[t-1]} \right), }[/math]

where, again, [math]\displaystyle{ \mathbf{W}_{n,n'} }[/math] must be of size [math]\displaystyle{ \mathbb{R}^{p_n\times p_{n'}} }[/math]. The first layer has the typical RNN input rule as before,

[math]\displaystyle{ \mathbf{z}_{1}^{[t]} = \sigma_1\left( \mathbf{b}_{1} + \mathbf{W}_{1}\mathbf{z}_{0}^{[t]} + \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]} \right). }[/math]

Sample format

Recurrent neural networks are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm> Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. </ref> Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input, because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks, whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave> Graves, Alex. "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850 (2013). </ref>)

Where [math]\displaystyle{ \,S }[/math] is the base/source sentence, [math]\displaystyle{ \,T }[/math] is the paired translated sentence and [math]\displaystyle{ \,T_r }[/math] is the total training set. This objective function is to maximize the log probability of a correct translation [math]\displaystyle{ \,T }[/math] given the base/source sentence [math]\displaystyle{ \,S }[/math] over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:

[math]\displaystyle{ \hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S) }[/math]

It has been showed that Long Short-Term Memory recurrent neural networks have the ability to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction <ref name=grave> Reference </ref>.

Training and Results

Training Method

Results

The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.

Active Voice: I ate an apple.

Passive Voice: The apple was eaten by me.

In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].

Some developments

Open questions

Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).
For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?

Section

Criticisms

Source

@@ Line 1: / Line 1: @@
-Dimensionality Reduction by Learning an Invariant Mapping
+XGBoost: A Scalable Tree Boosting System
 == Presented by ==
@@ Line 12: / Line 11: @@
+= Introduction =
-== Intention ==
+= Model =
-The drawbacks of most existing technique:
-Most of them depend on a meaningful and computable distance metric in input space.
+=== Theory  ===
-(eg. LLE, Isomap relies on computable distance)
-They do not compute a “function” that can accurately map new input samples whose relationship to the training data is unknown.
-To overcome these drawbacks, this paper introduces a technique called DrLIM. The learning relies solely on neighborhood relationships and does not require any distance measure in the input space.
+<math>
+    \mathbf{z}_n =
+                \sigma_n\left(\mathbf{W}_{n}\mathbf{z}_{n-1} + \mathbf{b}_{n}\right), \quad 1 \leq n \leq N</math>
-== Mathematical Model==
+where <math>\mathbf{W}_{n} \in \mathbb{R}^{p_n \times p_{n-1}}</math> is the weight matrix and <math>\mathbf{b}_{n} \in \mathbb{R}^{p_n}</math> is the bias vector associated with the <math>n</math>th layer in the network.
-'''Input''': A set of vectors <math> I=\{x_1,x_2,......,x_p\} </math>, where <math> x_i\in \mathbb{R}^D, \forall i=1,2,3......,n. </math>
+The element-wise vector function <math>\sigma_n\left(\cdot\right)</math> is sigmoid-like for each component in its domain, outputing a value that ranges in <math>[0,1]</math>. Typically, the functions <math>\left\{\sigma_n\left(\cdot\right)\right\}</math> are the same for <math>n < N</math>, but the final output <math>\sigma_N(\cdot)</math> depends on the network architecture—for instance, it may be a softmax function for multi-label classification. Thus the network is completed characterized by its weights and biases as the tuple <math>(\left\{\mathbf{W}_{n}\right\},\left\{\mathbf{b}_{n}\right\})</math>.
-'''Output''': A parametric function <math>G_W:\mathbb{R}^D \rightarrow \mathbb{R}^d </math> with <math> d<<D </math> such that it satisfies the following 3 properties:1.Simple distance measures in the output space should approximate the neighborhood relationships in the input space; 2.The mapping should not be constrained to implementing simple distance measures in the input space and should be able to learn invariances to complex transformations; 3.It should faithful even for samples whose neighborhood relationship are unknown.
+A sample network for <math>N = 2</math> is depicted below: a graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.
+<center>
+[[File:ann.png | frame | center |Fig 1. Graph of an ANN with <math>N = 2</math>, where the vertices represent the vectors <math>\left\{\mathbf{z}_n\right\}_{n=0}^2</math> in their respective layers. Edges denote computation, where the vector transformations have been overlaid to show sample dimensions of <math>\mathbf{W}_1</math> and <math>\mathbf{W}_2</math>, such that they match the vectors <math>\mathbf{z}_0</math> and <math>\mathbf{z}_1</math>.
+ ]]
+</center>
-'''Build up mathematical model''':
+A Recurrent Neural Network is a generalization of an ANN for a ''sequence'' of inputs <math>\left\{\mathbf{z}_0^{[t]}\right\}</math> where <math>t \in
-By clustering the data points in high dimension, we could map the high dimensional data points to lower dimension. We could build up clusters according to similarities of the vectors. The similarities are measured by prior knowledge. Set Y=0 if <math> x_1 </math> and <math> x_2 </math> are deemed similar; otherwise set Y=1. This clustering problem can be reduced to an optimization problem. Define following functions:
+\left\{1,\ldots,T\right\}</math> such that there are ''recurrent'' connections between the intermediary vectors <math>\left\{\mathbf{z}_n\right\}</math> for different so-called ''time steps''. These connections are made to represent conditioning on the previous vectors in the sequence: supposing the sequence were a vectorized representation of the words, an input to the network could be: <math>\left\{\mathbf{z}_0^{[1]},\mathbf{z}_0^{[2]},\mathbf{z}_0^{[3]}\right\} =
+\left\{\text{pass}, \text{the}, \text{sauce}\right\}</math>. In a language modelling problem for predictive text, the probability of obtaining <math>\mathbf{z}_0^{[3]}</math> is strongly conditioned on the previous words in the sequence. As such, additional recurrence weight matrices are added to the update rule for <math>1 \leq n \leq N</math> and <math>t > 1</math> to produce the recurrent update rule
+<math>
+    \mathbf{z}_n^{[t]} =
+        \sigma_n\left(
+            \mathbf{b}_{n}
+        +   \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]}
+        +   \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}
+        \right)</math>
+where <math>\mathbf{R}_n \in \mathbb{R}^{p_n \times p_n}</math> is the recurrence matrix that relates the <math>n</math>th layer’s output for item <math>t</math> to its previous output for item <math>t-1</math>. The network architecture for a single layer <math>n</math> at step <math>t</math> is pictured below. This is a schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured.
 <center>
-<math> D_W^i(x_1,x_2)=\left \| G_W(x_1)-G_W(x_2)\right \|_2  \qquad (1) </math>
+[[File:rnn.png | frame | center |Fig 2. Schematic of an RNN layer <math>n</math> at step <math>t</math> with recurrence on the output of <math>\mathbf{z}_n^{[t-1]}</math>, with the dimensions of the matrices <math>\mathbf{R}_{n}</math> and <math>\mathbf{W}_{n}</math> pictured. ]]
+</center>
+===Section  ===
+The RNN update rule used by Sutskever et al. comes from a paper by Graves (2013). The connections between layers are denser in this case. The final layer is fully connected to every preceding layer execept for the input <math>\mathbf{z}_0^{[t]}</math>, and follows the update rule
+<math>
+        \mathbf{z}_{N}^{[t]} = \sigma_n\left(
+            \mathbf{b}_N
+        + \displaystyle\sum_{n' = 1}^{N-1} \mathbf{W}_{N,n'}\mathbf{z}_{n'}^{[t]}
+    \right)</math>
+where <math>\mathbf{W}_{N,n'} \in \mathbb{R}^{p_N\times p_{n'}}</math> denotes the weight matrix between layer <math>n'</math> and <math>N</math>.
+The layers 2 through <math>N-1</math> have additional connections to <math>\mathbf{z}_0^{[t]}</math> as
+<math>
+        \mathbf{z}_n^{[t]} = \sigma_n\left(
+            \mathbf{b}_{n}
+        +   \mathbf{W}_{n}\mathbf{z}_{n-1}^{[t]}
+        +   \mathbf{W}_{n,0}\mathbf{z}_0^{[t]}
+        +   \mathbf{R}_n\mathbf{z}_{n}^{[t-1]}
+    \right),</math>
-<math> l(W,(Y,x_1,x_2)^i))=(1-Y)L_S(D_W^i)+YL_D(D_W^i) \qquad (2) </math>
+where, again, <math>\mathbf{W}_{n,n'}</math> must be of size <math>\mathbb{R}^{p_n\times
+    p_{n'}}</math>. The first layer has the typical RNN input rule as before,
-<math> L(W)= \sum^P_{i=1} l(W,(Y,x_1,x_2)^i) \qquad (3) </math>
-</center>
-where <math>(Y,x_1,x_2)^i</math> is the <math>i</math>-th labeled sample pair, <math>L_S</math> is the partial loss function for the points in the same cluster, <math>L_D</math> is the partial loss function for points in different clusters.
-<math>L_S</math> and <math>L_D</math> are determined by ourselves and must be designed such that minimizing <math>L</math> respect to W results in low values of <math>D_W </math> for similar pairs and high values of <math>D_W </math> for disimilar pairs. The intuition is, minimizing the function <math>L(\ )</math> could build up clusters over input data set in the low dimensional space.
-The exact loss function is
-<center>
 <math>
-l(W, (Y, x_1, x_2))=(1-Y)\frac{1}{2}(D_W)^2+(Y)\frac{1}{2}\{max(0, m-D_W)\}^2
+        \mathbf{z}_{1}^{[t]} = \sigma_1\left(
+            \mathbf{b}_{1}
+        +   \mathbf{W}_{1}\mathbf{z}_{0}^{[t]}
+        +   \mathbf{R}_{1}\mathbf{z}_{1}^{[t-1]}
+    \right).
 </math>
-</center>
-where m>0 is a margin.
-'''Algorithm''':
+===Sample format===
+[http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ Recurrent neural networks] are a variation of deep neural networks that are capable of storing information about previous hidden states in special memory layers.<ref name=lstm>
+Hochreiter, Sepp, and Jürgen Schmidhuber. [http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf "Long short-term memory."] Neural computation 9.8 (1997): 1735-1780.
+</ref> Unlike feed forward neural networks that take in a single fixed length vector input and output a fixed length vector output, recurrent neural networks can take in a sequence of fixed length vectors as input, because of their ability to store information and maintain a connection between inputs through this memory layer. By comparison, previous inputs would have no impact on current output for feed forward neural networks, whereas they can impact current input in a recurrent neural network. (This paper used the LSTM formulation from Graves<ref name=grave>
+Graves, Alex. [http://arxiv.org/pdf/1308.0850.pdf "Generating sequences with recurrent neural networks."] arXiv preprint arXiv:1308.0850 (2013).
+</ref>)
+Where <math>\,S</math> is the base/source sentence, <math>\,T</math> is the paired translated sentence and <math>\,T_r</math> is the total training set. This objective function is to maximize the log probability of a correct translation <math>\,T</math> given the base/source sentence <math>\,S</math> over the entire training set. Once the training is complete, translations are produced by finding the most likely translation according to LSTM:
+:<math>\hat{T} = \underset{T}{\operatorname{arg\ max}}\ p(T|S)</math>
+<br />It has been showed that Long Short-Term Memory recurrent neural networks have the ability to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction <ref name=grave>
+Reference
+</ref>.
+= Training and Results =
+=== Training Method ===
-Step 1: For each input sample <math>x_i</math>, do the following:
-(a)Using prior knowledge to find the set of samples <math> S_{x_i}=\{x_j\}_{j=1}^p </math> where <math> x_j </math> is similar to <math> x_i </math> for <math>1 \leqslant i \leqslant  p </math>
-(b)Pair the sample <math>x_i</math> with the other training samples and label the pairs so that:<math>Y_ij=0</math> if <math>x_j \in S_{x_i}</math> and <math>Y_{ij}=1</math> otherwise.
+=== Results ===
+The resulting LSTM neural networks outperformed standard Statistical Machine Translation (SMT) with a BLEU score of 34.8 against 33.3 and with certain heuristics or modification, was very close to matching the best performing system. Additionally, it could recognize sentences in both active and passive voice as being similar.
+<blockquote>
+Active Voice: I ate an apple.
+</blockquote>
+<blockquote>
+Passive Voice: The apple was eaten by me.
+</blockquote>
-Step 2: Repeat until convergence:
-(a) For each pair <math>(x_i,x_j)</math> in the training set, do
+In summary the LSTM method has proven to be quite capable of translating long sentences despite potentially long delay between input time steps. However, it still falls short of [Edinburgh's specialised statistical model http://www.statmt.org/OSMOSES/sysdesc.pdf].
-i. If <math>Y_{ij}=0</math>, then update W to decrease <math>D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2</math>
+=== Some developments  ===
-ii. If <math>Y_{ij}=1</math>, then update W to increase <math>D_W=\left \| G_W(x_i)-G_W(x_j) \right \|_2</math>
-== Experiments==
+=== Open questions ===
-'''How do they define neighbor and what is ''G<sub>W</sub>'' in practice?'''
-The experiments involving airplane images from NORB dataset use a 2-layer fully connected neural network as ''G<sub>W</sub>''. The neighbor is constructed based on the temporal continuity of the camera, i.e. images are similar if they were taken from contiguous elevation or azimuth regardless of lighting.
+# Instead of reversing the input sequence the target sequence could be reversed. This would change the time lags between corresponding words in a similar way, but instead of reducing the time lag between the first half of corresponding words, it is reduced between the last half of the words. This might allow conclusions about whether the improved performance is purely due to the reduced minimal time lag or whether structure in natural language is also important (e.g. when a short time lag between the first few words is better than a short time lag between the last few words of sentence).
+# For half of the words the time lag increases to more than the average. Thus, they might have only a minor contribution to the model performance. It could be interesting to see how much the performance is affected by leaving those words out of the input sequence. Or more generally, one could ask, how does the performance related to the number of used input words?
-The experiments on the MNIST dataset used a convolutional network as ''G<sub>W</sub>''. The neighborhood graph is generated with euclidean distances (5 nearest neighbors).
+= Section =
-==  Discussion ==
+= Criticisms =
-This method seems very similar to the Large Margin Nearest Neighbor (LMNN) method, because it also aims to maximize the difference between same-label neighbor distance and different-label neighbor distance. One difference is that it uses a heuristic method to find the solution, while the optimization problem of LMNN is convex and thus it can find the exact solution.
+= Source =

summary: Difference between revisions

Revision as of 22:02, 14 March 2018

Contents

Presented by

Introduction

Model

Theory

Section

Sample format

Training and Results

Training Method

Results

Some developments

Open questions

Section

Criticisms

Source

Navigation menu

summary: Difference between revisions

Revision as of 22:02, 14 March 2018

Presented by

Introduction

Model

Theory

Section

Sample format

Training and Results

Training Method

Results

Some developments

Open questions

Section

Criticisms

Source

Navigation menu

Search