statwiki - User contributions [US]

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:07:35Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>r_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon, s_{k1} = min (x_{ik}), s_{kl} = max (x_{ik}), \forall i</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:07:04Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon, s_{k1} = min (x_{ik}), s_{kl} = max (x_{ik}), \forall i</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:06:33Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon, s_{k1} = min (x_{ik}), s_{kl} = max (x_{ik}), /forall i</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:05:45Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon, s_{k1} = min (x_{ik}), s_{kl} = max (x_{ik})</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:04:59Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon, s_k1 = min {x_{ik}}, s_kl = max {x_{ik}}</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:04:12Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon, s_k1 = min x_ik, s_kl = max x_ik</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-28T22:02:52Z

Q39zhao: /* 3.3 Weighted Quantile Sketch */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formally, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

where x's are the feature values of each data point, and h's are the weights of the corresponding x's.

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The output of this function provides a general idea on what is the proportion of data who has k-th feature value less than the k-th feature value of data point z. The objective is to search for split points

<math>{s_{k1}, s_{k2}, …, s_{kl}}</math>

such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T23:56:39Z

Q39zhao: /* 4.1 Column Block for Parallel Learning */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_n )</math>

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs <math>O(Kd‖x‖_0 log_q )</math>

Approximate algorithm with block structure costs <math>O(Kd‖x‖_0+ ‖x‖_0 log_B )</math>

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T23:56:02Z

Q39zhao: /* 4.1 Column Block for Parallel Learning */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T23:55:07Z

Q39zhao: /* 4.1 Column Block for Parallel Learning */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T23:51:31Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>Obj^{(t)}=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T23:50:56Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

<math>=\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:55:09Z

Q39zhao: /* 6.3 Classification */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png|center]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:54:57Z

Q39zhao: /* 6.2 Dataset and Setup */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png|center]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:54:38Z

Q39zhao: /* 4.2 Cache-aware Access */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png|center]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png|center]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:53:09Z

Q39zhao: /* 3.4 Sparsity-aware Split Finding */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png|center]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png|center]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png|center]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:52:43Z

Q39zhao: /* 3.1 Exact Greedy Algorithm */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png|center]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:52:28Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png|center]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:52:04Z

Q39zhao: /* 3.2 Approximate Algorithm */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png|center]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png|center]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:50:27Z

Q39zhao: /* Reference */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011

[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.

[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.

[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001

[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:50:07Z

Q39zhao:

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

== Reference ==

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011
[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.
[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.
[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001
[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:49:49Z

Q39zhao:

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

=== Reference ===

[1] R,Bekkerman, M. Bilenko, and J.Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011
[2] G. Ridgeway. Generalized Boosted Models: A guide to the gym package.
[3] C.Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23-581,2010.
[4] J.Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189-1232,2001
[5] T.Zhang and R.Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:25:46Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_i), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:23:48Z

Q39zhao: /* 4.1 Column Block for Parallel Learning */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:23:28Z

Q39zhao: /* 4.1 Column Block for Parallel Learning */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs <math>O(Kd‖x‖_0 {log_n} )</math>

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:23:02Z

Q39zhao: /* 4.1 Column Block for Parallel Learning */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs $O(Kd‖x‖_0 {log_n} )$

Tree boosting on block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_n )$

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs $O(Kd‖x‖_0 log_q )$

Approximate algorithm with block structure costs $O(Kd‖x‖_0+ ‖x‖_0 log_B )$

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:20:42Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y_{i}^{(0)} = 0</math>

<math>\hat y_{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y_{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:19:43Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\Delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:19:00Z

Q39zhao: /* 2.1 Regularized Learning Objective */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\Delta x+\frac{1}{2}f^{''}(x)\Delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T19:08:57Z

Q39zhao:

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== Introduction ==

In existing society, machine learning and data-driven methods are significant and they use widely. Tree boosting is considered to be one of the best machine learning methods, it provides us state-of-the-art results to solve a wide of range problems. We mainly introduce XGBoost, a scalable end-to-end tree boosting system in this page. We demonstrate the exact greedy algorithm and approximate algorithm. Further, we propose two important parts in the approximate algorithm: novel sparsity-aware algorithm and weighted quantile sketch. For comparison, we explore the reasons why XGBoost become important in many areas.

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:46:06Z

Q39zhao: /* 6.6 Distributed Experiment */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|center|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|center|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:45:45Z

Q39zhao: /* 6.5 Out-of-core Experiment */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|center|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:45:23Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|center]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:45:01Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png|right]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:44:40Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|center|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

File:machine.png

2018-11-22T18:42:30Z

Q39zhao:

File:criteo.png

2018-11-22T18:41:54Z

Q39zhao:

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:41:37Z

Q39zhao: /* 6.6 Distributed Experiment */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]
[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|thumb|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|thumb|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

File:Training.png

2018-11-22T18:41:09Z

Q39zhao:

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:40:52Z

Q39zhao: /* 6.5 Out-of-core Experiment */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]
[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|thumb|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:40:27Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]
[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:39:51Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:39:15Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:38:43Z

Q39zhao: /* 6.5 Out-of-core Experiment */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|200px|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

File:Threads.png

2018-11-22T18:37:55Z

Q39zhao:

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:37:37Z

Q39zhao: /* 6.4 Learning to Rank */

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File:Threads.png|200px|thumb|left|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.

File:Table 4.png

2018-11-22T18:35:15Z

Q39zhao:

File:Table 3.png

2018-11-22T18:34:55Z

Q39zhao:

File:Table 2.png

2018-11-22T18:34:33Z

Q39zhao:

XGBoost: A Scalable Tree Boosting System

2018-11-22T18:34:00Z

Q39zhao:

== Presented by ==
*Qianying Zhao
*Hui Huang
*Lingyun Yi
*Jiayue Zhang
*Siao Chen
*Rongrong Su
*Gezhou Zhang
*Meiyu Zhou

== 2 Tree Boosting In A Nutshell ==

=== 2.1 Regularized Learning Objective ===

1. Regression Decision Tree (also known as classification and regression tree):
* Decision rules are the same as in decision tree
* Contains one score in each leaf value

[[File:cart.PNG]]
[[File:tree_ensemble_model.PNG]]

2. Model and Parameter

Model: Assuming there are K trees
<math>\hat y_i = \sum^K_{k=1} f_k(x_I), f_k \in Ƒ</math>

Object: <math>Obj = \sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k)</math>

where <math>\sum^n_{i=1}l(y_i,\hat y_i)</math> is training loss, <math>\sum_{k=1}^K \omega(f_k)</math> is complexity of Trees

So the target function that needed to optimize is:<math>\sum_{i=1}^n l(y_i,\hat y_i)+\sum^K_{k=1}\omega(f_k), f_k \in Ƒ</math>, where <math>\omega(f) = \gamma T+\frac{1}{2}\lambda||w||^2</math>

For example:

[[File:leave.png]]

Let's look at <math>\hat y_i</math>

<math>\hat y{i}^{(0)} = 0</math>

<math>\hat y{i}^{(1)} = f_1(x_i)=\hat y_i^{(0)}+f_1(x_i)</math>

<math>\hat y{i}^{(2)} = f_1(x_i) + f_2(x_i)=\hat y_i^{(1)}+f_2(x_i)</math>

...

<math>\hat y{i}^{(t)} = \sum^t_{i=1}f_k(x_i)=\hat y_i^{(t-1)}+f_t(x_i)</math>

So <math>Obj^{(t)} = \sum_{i=1}^n l(y_i,\hat y_i^{(t)})+\sum^t_{i=1}\omega(f_i)</math>

=<math>\sum_{i=1}^n l(y_i,\hat y_i^{(t-1)}+f_t(x_i))+omega(f_t)+constant</math>

Take Taylor Expansion of the objective

<math>f(x+\delta x) \simeq f(x)+f^{'}(x)\delta x+\frac{1}{2}f^{''}(x)\delta x^2</math>

then

<math>Obj^{(t)} = \sum^n_{i=1}[l(y_i,\hat y_i^{(t-1)})+g_if_t(x_i)+\frac{1}{2}h_i(x_i)]+\omega(f_t)+constant</math>

where <math>g_i =ə_{(\hat y_i)^{(t-1)}}(\hat y_i^{(t-1)}-y_i)^2 = 2(\hat y_i^{(t-1)}-y_i)h_i = ə^2_{(\hat y_i)^{(t-1)}}(\hat y_i^{t-1)}-y_i)^2 =2</math>

Define <math>I_j={i|q(x_i)=j}</math> as the instance set of leaf j and <math>f_t(x_i)=w_j</math>. We can rewrite target function as follows

<math>Obj^{(t)} = \sum^{T}_{j=1}[(\sum_{i\in I_j} g_i)w_{j}+\frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2]+\gamma T</math>

The optimal weight <math>w^*_j</math> of node j is <math>w_j^*=\frac{\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i+\lambda}</math>

The loss reduction after the split is given by

<math>Obj_{split}=\frac{1}{2}[\frac{(\sum_{i \in I_l} g_i)^2}{\sum_{i \in I_l} h_i+\lambda}+\frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i+\lambda}-\frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i+\lambda}]-\lambda</math>

== 3 Split Finding Algorithms ==

=== 3.1 Exact Greedy Algorithm ===

Exact greedy algorithm is a split finding algorithm enumerates over all the possible splits on all the features. However, it is impossible to efficiently do so when the data does not fit entirely into memory.

The algorithm is following:

[[File: Algorithm_1.png]]

=== 3.2 Approximate Algorithm ===

Due to limit computational memory and efficiency, the paper gives an approximate algorithm. The algorithm first proposes candidate splitting points according to percentiles of feature distribution, then it maps the continuous features into buckets split by these candidate points, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.

[[File:Algorithm_2.png]]

The global variant proposes all the candidate splits during the initial phase of tree construction, and uses the same proposals for split finding at all levels. The local variant re-proposes after each split.

[[File:iterations.png]]

From the figure above, the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level.

=== 3.3 Weighted Quantile Sketch ===

Data set splitting is one of the most important phase in the approximate algorithm. The most common approach is to split by feature’s percentile in order to obtain an uniform distribution of the selected data.

Formal, if we have the set

<math>D_k={(x_{1k},h_1),(x_{2k},h_2),...,(x_{nk},h_n)}</math>

We can use the following function to rank:

<math>R_k(z) = \frac{1}{\sum_{(x,h) \in D_k} h} \sum_{(x,h) \in D_k, x<z} h,</math>

The objective is to search for split points {s_{k1}, s_{k2}, …, s_{kl}} such that

<math>|r_k(s_{k,j}) – r_k(s_{k,j+1})| < \epsilon,</math>

Where <math>\epsilon</math> is an approximation factor. In general, it should approximately have <math>\frac{1}{\epsilon}</math> splitting points.

=== 3.4 Sparsity-aware Split Finding ===

In real life, the input x may often be quite sparse. Possible causes are:

1. Data set contains missing values

2. Large amount of zero entries

3. Artifacts of feature engineering (ex. One-hot encoding)

In order to solve the sparsity behavior in the data, it is proposed to create a default direction in each tree node, as shown below:

[[File: figure_4.png]]

When a value in a tree node is missing, we can use the following algorithm to calculate the optimal direction to proceed:

[[File: figure_5.png]]

This algorithm is also applicable to the situation where user can set a limit on the accepted value, and neglect the out-of-range value when calculating the score.

The figure below shows the result of the comparison between a basic implementation and the sparsity aware algorithm on a Allstate-10K dataset.

[[File: Algorithm3.png]]

We can see that the sparsity aware algorithm performs 50 times better than the simple implementation.

== 4 System Design ==

=== 4.1 Column Block for Parallel Learning ===

Generally, the most time-consuming part of tree learning is to get a sorted data. In XGBoost, data is stored in in-memory units, called Block.

[[File: Figure_6.png]]

Each column represents a feature and is sorted by the feature value.

In exact greedy algorithm, the entire dataset is stored in a single block. So, a single scan over the block will provide us the statistics needed for splitting.

d = maximum depth of the tree

K = total number of trees

Original spase aware algorithm costs

Tree boosting on block structure costs

For Approximate algorithm, the dataset can be stored in multiple blocks. Each block contains a subset of tuples in the dataset. The blocks are also in sorted order, so for the quantile finding step, a linear scan over the sorted column is enough.

q = number of proposal candidates in the dataset

B = maximum number of rows in each block

Original approximate algorithm with binary search costs

Approximate algorithm with block structure costs

=== 4.2 Cache-aware Access ===

A naïve implementation of split enumeration brings in immediate read/write dependency between the accumulation and the non-continuous memory fetch operation.

[[File: figure_8.png]]

To overcome this issue in exact greedy algorithm, a cache-aware prefetching algorithm with an internal buffer allocated for fetching the gradient statistics is proposed.
For approximate algorithm, since multiple blocks are used for storing the dataset, choosing the correct block size is the key.

[[File: figure_9.png]]

An overly small block size results in small workload and inefficient parallelization

· An overly large block size results in cache misses as the gradient statistics do not fit into the CPU cache

Through experiment, block size of 216 balances the cache property and parallelization.

For large size dataset, data might not be fitted into main memory and has to be stored in disk space. So, enabling out-of-core computation is important for achieving scalable learning. It is ideal to have computation run in concurrence with disk reading to reduce the overhead. Two major techniques used to improve the out-of-core computation are shown below:

1) Block Compression

* Compress feature value in each block

* Decompress feature value through the thread

* Compress ratio: 26~29%

2) Block Sharding

* When multiple disks are available

* Shard the dataset onto multiple disks in an alternative manner

* Each disk has a pre-fetcher thread to fetch data into an in-memory buffer

* Training-thread alternatively reads data from each buffer

== 6 End To End Evaluations ==

=== 6.1 System Implementation ===

The system implementation of XGBoost is a portable and reusable open source package. Not only XGBoost supports various weighted classification and objective functions(rank, user-defined), but also supports popular languages(python, R, Julia), data science pipelines (scikit-learn), big-data stacks(Flink, Spark), cloudplatform (Alibaba’s Tianchi8) and more.

=== 6.2 Dataset and Setup ===

Four datasets are used in performance evaluations. The first dataset Allstate insurance claim dataset9 that was used for predicting the likelihood of an insurance claim, evaluates the impact of sparsity-aware algorithm. The second dataset Higgs boson dataset10 that was produced from physics simulation events classifies whether an event corresponds to the Higgs boson. The third dataset is the Yahoo! learning for ranking documents by query relevance. The last dataset is the criteo terabyte click log dataset11 that was pre-processed as a tree-based model, evaluates the scaling property of the system in the out-of-core and the distributed settings.The first three datasets are used for the single machine parallel setting, and the last dataset was used for the distributed and out-of-core settings. Boosting trees have a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling.

[[File:Table_2.png]]

=== 6.3 Classification ===

Four groups of XGBoost performance evaluations are conducted by comparisons.
Compared with R’s GBM, the first evaluation sets XGBoost running using the exact greedy algorithm fairly on Higgs-1M data and have scikit-learn finish running along the side, and shows that XGBoost runs more than 10x faster than scikit-learn. R’s GBM greedily expands one branch of a tree fast but results in lower accuracy, while both scikit-learn and XGBoost learn a full tree. Column subsamples gives slightly worse performance possibly due to few important features in this dataset.

[[File: Table_3.png]]

=== 6.4 Learning to Rank ===

Group 2 evaluates XGBoost on the learning to rank problems by comparing against pGBRT. XGBoost runs the exact greedy algorithm and obviously runs faster. Subsampling columns not only reduces running time, but also gives a bit higher performance, empirically due to that the subsampling helps prevent overfitting.

[[File: Threads.png|Comparison between XGBoost and PG-BRT on Yahoo LTRC dataset]]

[[File: Table_4.png]]

=== 6.5 Out-of-core Experiment ===

Group 3 evaluates XGBoost system in the out-of-core setting on the criteo data on one AWS c3.8xlarge machine (32 vcores, two 320 GB SSD, 60 GB RAM). Compression helps to speed up computation by factor of 3, and sharding into 2 disks further gives 2x speedup. It’s observed with a less dramatic transition point when the system runs out of file cache due to larger disk throughput and better utilization of computation resources.

[[File: Training.png|Comparison of out-of-core methods on different subsets of crate data. The missing data points are due to out of disk space.]]

=== 6.6 Distributed Experiment ===

Group 4 evaluates the XGBoost system in the distributed setting by setting up a YARN cluster on EC2 with m3.2x large machines with 8 virtual cores each, 30GB RAM, two 80GB SSD local disks, and dataset storage on AWS S3. Comparing against Spark MLLib and H2O 12, in-memory analytics frameworks that need to store the data in RAM, XGBoost can switch to out-of-core setting when it runs out of memory. With the limited computing resources, XGBoost runs faster than the baseline systems, and takes advantage of out-of-core computing and smoothly scale to all 1.7 billion instances, whereas the baseline systems are only able to handle subset of the data. XGBoost’s performance scales linearly as adding more machines, and has large potential to handle even larger data as it managed to handle 1.7 billion data with only 4 machines.

[[File:criteo.png|Comparison of different distributed systems on 32 EC2 nodes for 10 iterations on different subset of crate data]]

[[File: machine.png|Scaling of XGboost with different number of machines on criteo full 1.7 billion dataset. Using more machines results in more file cache and makes the system run faster, causing the trend to be slightly super linear.]]

== Conclusion ==

The purpose of this paper is discussing how a scalable end-to-end tree boosting system, which is XGBoost, effective used. It helps us to achieve state-of-the-art results on variety experiment challenges. We use the exact greedy algorithm in order to find the best split in tree learning. Be more effective, an approximate algorithm is needed. Therefore we introduce sparsity-aware algorithm and weighted quantile sketch for approximate algorithm. Further, we gain an insight into XGBoost system on column block, cache-aware access patterns, and explain why XGBoost scales is better and wider use than other systems in the existing data statistic.