IPBoost: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 3: Line 3:


== Introduction ==  
== Introduction ==  
Boosting is an important and by now standard technique in classification to combine several “low accuracy” learners, so-called base learners, into a “high accuracy” learner, a so-called boosted learner. Pioneered by the AdaBoost approach of [FS], in recent decades there has been extensive work on boosting procedures and analyses of their limitations. A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however clearly such a learner might not necessarily generalize well.  Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used learn a classifier:
Boosting is an important and by now standard technique in classification to combine several “low accuracy” learners, so-called base learners, into a “high accuracy” learner, a so-called boosted learner. Pioneered by the AdaBoost approach of [FS], in recent decades there has been extensive work on boosting procedures and analyses of their limitations.
 
 
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:
 
for <math> t= 1, \cdots, T </math> do the following:
 
# Push weight of the data distribution <math> \mathcal D_t</math> towards misclassified examples leading to <math> \mathcal D_{t+1}</math>
 
# Evaluate performance of <math> \mu_t</math> by computing its loss.
 
# Train a learner <math> \mu_t</math> from a given class of base learners on the data distribution <math> \mathcal D_t</math>
 
Finally, the learners are combined by some form of voting (e.g., soft or hard voting, averaging, thresholding).[[File:boosting.gif|200px|thumb|right]]  A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as '''convex potential boosters'''. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however clearly such a learner might not necessarily generalize well.  Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used learn a classifier:





Revision as of 12:59, 16 November 2020

Presented by

Casey De Vera, Solaiman Jawad

Introduction

Boosting is an important and by now standard technique in classification to combine several “low accuracy” learners, so-called base learners, into a “high accuracy” learner, a so-called boosted learner. Pioneered by the AdaBoost approach of [FS], in recent decades there has been extensive work on boosting procedures and analyses of their limitations.


In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:

for [math]\displaystyle{ t= 1, \cdots, T }[/math] do the following:

  1. Push weight of the data distribution [math]\displaystyle{ \mathcal D_t }[/math] towards misclassified examples leading to [math]\displaystyle{ \mathcal D_{t+1} }[/math]
  1. Evaluate performance of [math]\displaystyle{ \mu_t }[/math] by computing its loss.
  1. Train a learner [math]\displaystyle{ \mu_t }[/math] from a given class of base learners on the data distribution [math]\displaystyle{ \mathcal D_t }[/math]

Finally, the learners are combined by some form of voting (e.g., soft or hard voting, averaging, thresholding).

A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however clearly such a learner might not necessarily generalize well. Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used learn a classifier:


Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper we explore non-convex boosting in classification by means of integer programming and demonstrate real-world practicability of the approach while circumventing shortcomings of convex boosting approaches. The paper reports results that are comparable to or better than the current state-of-the-art.

Motivation

In reality we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. However if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data, however that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day this is not solving the fundamental problem.

IPBoost: Boosting via Integer Programming

Integer Program Formulation

Let [math]\displaystyle{ (x_1,y_1),\cdots, (x_N,y_N) }[/math] be the training set with points [math]\displaystyle{ x_i \in \mathbb{R}^d }[/math] and two-class labels [math]\displaystyle{ y_i \in \{\pm 1\} }[/math]

  • class of base learners: [math]\displaystyle{ \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} }[/math] and [math]\displaystyle{ \rho \ge 0 }[/math] be given.
  • error function [math]\displaystyle{ \eta }[/math]

Our boosting model is captured by the integer programming problem. We can call this our primal problem:

$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ \forall i=1,\cdots, N \\ &\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$


Its restricted dual problem is:

$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i \\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ &(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$


Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. This pricing problem can be expressed as follows:

$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$

Solution using branch and price procedure

Algorithm

[math]\displaystyle{ D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} }[/math], class of base learners [math]\displaystyle{ Ω }[/math], margin [math]\displaystyle{ \rho }[/math]
Output: Boosted learner [math]\displaystyle{ \sum_{j∈L^∗}h_jλ_j^* }[/math] with base learners [math]\displaystyle{ h_j }[/math] and weights [math]\displaystyle{ λ_j^* }[/math]

  1. [math]\displaystyle{ T ← \{([0, 1]^N, \emptyset)\} }[/math] // set of local bounds and learners for open subproblems
  2. [math]\displaystyle{ U ← \infty, L^∗ ← \emptyset }[/math] // Upper bound on optimal objective
  3. while [math]\displaystyle{ \ T \neq \emptyset }[/math] do
  4. Choose and remove [math]\displaystyle{ (B,L) }[/math] from [math]\displaystyle{ T }[/math]
  5. repeat
  6. Solve the primal IP using the local bounds on [math]\displaystyle{ z }[/math] in [math]\displaystyle{ B }[/math] with optimal dual solution [math]\displaystyle{ (w^∗, v^∗, u^∗) }[/math]
  7. Find learner [math]\displaystyle{ h_j ∈ Ω }[/math] satisfying the pricing problem. // Solve pricing problem.
  8. until [math]\displaystyle{ h_j }[/math] is not found
  9. Let [math]\displaystyle{ (\widetilde{λ} , \widetilde{z}) }[/math] be the final solution of the primal IP with base learners [math]\displaystyle{ \widetilde{L} = \{j | \widetilde{λ}_j \gt 0\} }[/math]
  10. if [math]\displaystyle{ \widetilde{z} ∈ \mathbb{Z}^N }[/math] and [math]\displaystyle{ \sum^{N}_{i=1}\widetilde{z}_i \lt U }[/math] then
  11. [math]\displaystyle{ U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} }[/math] // Update best solution.
  12. else
  13. Choose [math]\displaystyle{ i ∈ [N] }[/math] with [math]\displaystyle{ \widetilde{z}_i \notin Z }[/math]
  14. Set [math]\displaystyle{ B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} }[/math]
  15. Add [math]\displaystyle{ (B_0,\widetilde{L}), (B_1,\widetilde{L}) }[/math] to T . // Create new branching nodes.
  16. end if
  17. end while
  18. Optionally sparsify final solution [math]\displaystyle{ L^* }[/math]

Results and Performance

References

  • Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.
  • Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf