# XGBoost

## Presented by

• Chun Waan Loke
• Peter Chong
• Clarice Osmond
• Zhilong Li

## Tree Boosting In A Nutshell

The author reviews gradient tree boosting in this section.

### Regularized Learning Objective

The author first gives the prediction equation (1) and the regularized objective (2).

(1) The additive prediction model is given as
$\hat{y_i}=\sum_{k=1}^{K}f_k(x_i), f_k \in \mathcal{F}$
where each $f_k$ corresponds to an independent tree structure with its weights, and $\mathcal{F}$ is the "space of regression trees"

(2) The regularized objective to be minimized is given as
$\mathcal{L}=\sum_{i}l(\hat{y_i},y_i)+\sum_k \Omega(f_k)$ where $\Omega(f)=\gamma T+\frac{1}{2}\lambda ||\omega||^2$
The first term $l$ is a differentiable convex loss function that calculates the measures between $\hat{y}$ and $y$.
The second term $\Omega$ penalizes on the complexity of model (the $f_k$'s). For a given tree ($f_k$), $T$ is the number of leaves, and $\omega$ is the leaf weights.
The L1 regularization term $\gamma T$ encourages the tree to be small, and the L2 regularization term $\frac{1}{2}\lambda ||\omega||^2$ encourages the tree to have small weights.

Let $\hat{y_i}^{(t)}$ be the prediction of $y_i$ at the t-th iteration.
Since the prediction equation (1) in 2.1 is simply additive, we have ${\hat y_i}^{(t)}=\hat{y_{i-1}}^{(t)}+f_t(x_i)$
Then the regularized objective (2) in 2.1 becomes

## System Design

### Column Block for Parallel Learning

The most time consuming part of tree learning is to get the data into sorted order. To overcome this problem, XGBoost stores the data in blocks where each block performs the sorting operation in parallel. The results are then merged together.

1. For the exact greedy algorithm, XGBoost stores the entire dataset in a single block
• Original spase aware algorithm costs: $O(Kd∥x∥_0 log(n))$
• Tree boosting on block structure costs: $O(Kd∥x∥_0 + ∥x∥_0 log(n))$
• Hence, block structure helps to save an additional $log(n)$ factor
2. For approximate algorithms, XGBoost stores the dataset in multiple blocks
• Original algorithm with binary search costs: $O(Kd∥x∥_0 log(q))$
• With block structure costs: $O(Kd∥x∥_0 + ∥x∥_0 log(B))$
• Hence, block structure helps to save an additional $log(q)$ factor

K = total number of trees
d = maximum depth of the tree
$∥x∥_0$ = number of non-missing entries in the training data
n = number of rows in the dataset
q = number of proposal candidates in the dataset, usually between 32 to 100
B = maximum number of rows in each block

### Cache-aware Access

The proposed block structure optimizes the computation complexity but requires indirect fetches of gradient statistics by row index. XGBoost optimizes the process by using the following methods.

1. For the exact greedy algorithm, XGBoost uses a cache-aware prefetching algorithm
• It stores Gradient and Hessians in the cache to make calculations fast
• It runs twice as fast as the naive method when the dataset is large
2. For approximate algorithms, XGBoost chooses a specific block size
• Choosing a small block size results in inefficient parallelization
• Choosing a large block size results in cache misses
• Various choices of block size are compared and the results are shown in Figure 9
• XGBoost chose $2^{16}$ as the block size

### Blocks for Out-of-core Computation

When the dataset is too large for the cache and main memory, XGBoost utilizes disk spaces as well. Since reading and writing data to the disks are slow, XGBoost optimizes the process by using the following two methods.

1. Block Compression
• Blocks are compressed by columns
• Although decompressing takes time, it is still faster than reading from the disks
2. Block Sharding
• If multiple disks are available, data is split into those disks
• When the CPU needs data, all the disks can be read at the same time

## End To End Evaluations

### System Implementation

XGBoost is a portable and reusable open-source package. It supports:

• various weighted classification
• various objective functions (rank & user defined)
• popular languages (python, R, Julia)
• data science pipelines (scikit-learn)
• big data stacks (Flink, Spark)
• cloud platform (Tianchi8)

### Dataset and Setup

A summary of the 4 datasets used in the experiments is in table 2.

The first and last datasets were used specifically to evaluate the impact of sparsity-aware algorithm and the scaling property of the system in the out-of-core and the distributed settings respectively. In all of the experiments, common settings such as maximum depth equal to 8, shrinkage equal to 0.1 and no column subsampling are applied.

### Classification

A subset (n = 1M) of Higgs Boson dataset was used for classification. The results are shown in Table 3.

• R’s GBM creates a tree fast by using greedy approach that only expands one branch of a tree but with the cost of lower accuracy
• XGBoost and scikit-learn have better performance than R’s GBM
• XGBoost runs more than 10x faster than scikit-learn in learning a full tree
• Column subsamples give slightly worse performance possibly due to a few important features in this dataset.

### Learning to Rank

The results in comparing XGBoost with pGBRT are shown in Table 4.

• XGBoost runs exact greedy algorithm while pGBRT only supports an approximate algorithm
• XGBoost runs faster
• Subsampling column reduces running time and gives higher performance possibly due to preventing overfitting

### Out-of-core Experiment

An extremely large Criteo dataset was used to evaluate the out-of-core setting. The results are shown in the figure below.

• Compression speed up the computation by a factor of 3
• Sharding into 2 disks further gives 2x speed

### Distributed Experiment

YARN cluster on EC2 with m3.2xlarge machines was used to evaluate the distributed setting. XGBoost was compared with Spark MLLib and H20.

• XGBoost can switch to out-of-core setting when the memory runs out, while the baseline systems need to store the data in RAM
• XGBoost runs faster than the baseline systems
• The baseline systems are only able to handle a subset of the data with the given resources while XGBoost was able to scale smoothly to all 1.7B rows
• XGBoost’s performance scales linearly with more machines
• XGBoost is able to handle all 1.7B data with only 4 machines