IPBoost: Difference between revisions
m (→Introduction) |
|||
Line 62: | Line 62: | ||
<ol> | <ol> | ||
<li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math> // set of local bounds and learners for open subproblems </li> | <li margin-left:30px> <math> T ← \{([0, 1]^N, \emptyset)\} </math>                         // set of local bounds and learners for open subproblems </li> | ||
<li> <math> U ← \infty, L^∗ ← \emptyset </math> | <li> <math> U ← \infty, L^∗ ← \emptyset </math>                         // Upper bound on optimal objective </li> | ||
<li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li> | <li> '''while''' <math>\ T \neq \emptyset </math> '''do''' </li> | ||
<li> Choose and remove <math>(B,L) </math> from <math>T </math> </li> | <li>   Choose and remove <math>(B,L) </math> from <math>T </math> </li> | ||
<li> '''repeat''' </li> | <li>   '''repeat''' </li> | ||
<li> Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li> | <li>     Solve the primal IP using the local bounds on <math> z </math> in <math>B</math> with optimal dual solution <math> (w^∗, v^∗, u^∗) </math> </li> | ||
<li> Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem. // Solve pricing problem. </li> | <li>     Find learner <math> h_j ∈ Ω </math> satisfying the pricing problem.                         // Solve pricing problem. </li> | ||
<li> '''until''' <math> h_j </math> is not found </li> | <li>   '''until''' <math> h_j </math> is not found </li> | ||
<li> Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li> | <li>   Let <math> (\widetilde{λ} , \widetilde{z}) </math> be the final solution of the primal IP with base learners <math> \widetilde{L} = \{j | \widetilde{λ}_j > 0\} </math> </li> | ||
<li> '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li> | <li>   '''if''' <math> \widetilde{z} ∈ \mathbb{Z}^N </math> and <math> \sum^{N}_{i=1}\widetilde{z}_i < U </math> '''then''' </li> | ||
<li> <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math> // Update best solution. </li> | <li>     <math> U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} </math>                         // Update best solution. </li> | ||
<li> '''else''' </li> | <li>   '''else''' </li> | ||
<li> Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li> | <li>     Choose <math> i ∈ [N] </math> with <math> \widetilde{z}_i \notin Z </math> </li> | ||
<li> Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li> | <li>     Set <math> B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} </math> </li> | ||
<li> Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to T . // Create new branching nodes. </li> | <li>     Add <math> (B_0,\widetilde{L}), (B_1,\widetilde{L}) </math> to <math> T </math>.                         // Create new branching nodes. </li> | ||
<li> '''end''' if </li> | <li>   '''end''' if </li> | ||
<li> '''end''' while </li> | <li> '''end''' while </li> | ||
<li> Optionally sparsify final solution <math>L^*</math> </li> | <li> ''Optionally sparsify final solution <math>L^*</math>'' </li> | ||
</ol> | </ol> |
Revision as of 13:22, 16 November 2020
Presented by
Casey De Vera, Solaiman Jawad
Introduction
Boosting is an important and by now standard technique in classification to combine several “low accuracy” learners, so-called base learners, into a “high accuracy” learner, a so-called boosted learner. Pioneered by the AdaBoost approach of [FS], in recent decades there has been extensive work on boosting procedures and analyses of their limitations.
In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows:
for [math]\displaystyle{ t= 1, \cdots, T }[/math] do the following:
- Push weight of the data distribution [math]\displaystyle{ \mathcal D_t }[/math] towards misclassified examples leading to [math]\displaystyle{ \mathcal D_{t+1} }[/math]
- Evaluate performance of [math]\displaystyle{ \mu_t }[/math] by computing its loss.
- Train a learner [math]\displaystyle{ \mu_t }[/math] from a given class of base learners on the data distribution [math]\displaystyle{ \mathcal D_t }[/math]
Finally, the learners are combined by some form of voting (e.g., soft or hard voting, averaging, thresholding).
A close inspection of most boosting procedures reveals that they solve an underlying convex optimization problem over a convex loss function by means of coordinate gradient descent. Boosting schemes of this type are often referred to as convex potential boosters. These procedures can achieve exceptional performance on many data sets if the data is correctly labeled. In fact, in theory, provided the class of base learners is rich enough, a perfect strong learner can be constructed that has accuracy 1, however clearly such a learner might not necessarily generalize well. Boosted learners can generate quite some complicated decision boundaries, much more complicated than that of the base learners. Here is an example from Paul van der Laken’s blog / Extreme gradient boosting gif by Ryan Holbrook. Here data is generated online according to some process with optimal decision boundary represented by the dotted line and XGBoost was used learn a classifier:
Recently non-convex optimization approaches for solving machine learning problems have gained significant attention. In this paper we explore non-convex boosting in classification by means of integer programming and demonstrate real-world practicability of the approach while circumventing shortcomings of convex boosting approaches. The paper reports results that are comparable to or better than the current state-of-the-art.
Motivation
In reality we usually face unclean data and so-called label noise, where some percentage of the classification labels might be corrupted. We would also like to construct strong learners for such data. However if we revisit the general boosting template from above, then we might suspect that we run into trouble as soon as a certain fraction of training examples is misclassified: in this case these examples cannot be correctly classified and the procedure shifts more and more weight towards these bad examples. This eventually leads to a strong learner, that perfectly predicts the (flawed) training data, however that does not generalize well anymore. This intuition has been formalized by [LS] who construct a “hard” training data distribution, where a small percentage of labels is randomly flipped. This label noise then leads to a significant reduction in performance of these boosted learners; see tables below. The more technical reason for this problem is actually the convexity of the loss function that is minimized by the boosting procedure. Clearly, one can use all types of “tricks” such as early stopping but at the end of the day this is not solving the fundamental problem.
IPBoost: Boosting via Integer Programming
Integer Program Formulation
Let [math]\displaystyle{ (x_1,y_1),\cdots, (x_N,y_N) }[/math] be the training set with points [math]\displaystyle{ x_i \in \mathbb{R}^d }[/math] and two-class labels [math]\displaystyle{ y_i \in \{\pm 1\} }[/math]
- class of base learners: [math]\displaystyle{ \Omega :=\{h_1, \cdots, h_L: \mathbb{R}^d \rightarrow \{\pm 1\}\} }[/math] and [math]\displaystyle{ \rho \ge 0 }[/math] be given.
- error function [math]\displaystyle{ \eta }[/math]
Our boosting model is captured by the integer programming problem. We can call this our primal problem:
$$ \begin{align*} \min &\sum_{i=1}^N z_i \\ s.t. &\sum_{j=1}^L \eta_{ij}\lambda_k+(1+\rho)z_i \ge \rho \ \ \ \forall i=1,\cdots, N \\ &\sum_{j=1}^L \lambda_j=1, \lambda \ge 0,\\ &z\in \{0,1\}^N. \end{align*}$$
Its restricted dual problem is:
$$ \begin{align*}\max \rho &\sum^{N}_{i=1}w_i + v - \sum^{N}_{i=1}u_i \\ s.t. &\sum_{i=1}^N \eta_{ij}w_k+ v \le 0 \ \ \ \forall j \in L \\ &(1+\rho)w_i - u_i \le 1 \ \ \ \forall i \in [N] \\ &w \ge 0, u \ge 0, v\ free\end{align*}$$
Furthermore, there is a pricing problem used to determine, for every supposed optimal solution of the dual, whether the solution is actually optimal, or whether further constraints need to be added into the primal solution. This pricing problem can be expressed as follows:
$$ \sum_{i=1}^N \eta_{ij}w_k^* + v^* > 0 $$
Solution using branch and price procedure
Algorithm
[math]\displaystyle{ D = \{(x_i, y_i) | i ∈ I\} ⊆ R^d × \{±1\} }[/math], class of base learners [math]\displaystyle{ Ω }[/math], margin [math]\displaystyle{ \rho }[/math]
Output: Boosted learner [math]\displaystyle{ \sum_{j∈L^∗}h_jλ_j^* }[/math] with base learners [math]\displaystyle{ h_j }[/math] and weights [math]\displaystyle{ λ_j^* }[/math]
- [math]\displaystyle{ T ← \{([0, 1]^N, \emptyset)\} }[/math] // set of local bounds and learners for open subproblems
- [math]\displaystyle{ U ← \infty, L^∗ ← \emptyset }[/math] // Upper bound on optimal objective
- while [math]\displaystyle{ \ T \neq \emptyset }[/math] do
- Choose and remove [math]\displaystyle{ (B,L) }[/math] from [math]\displaystyle{ T }[/math]
- repeat
- Solve the primal IP using the local bounds on [math]\displaystyle{ z }[/math] in [math]\displaystyle{ B }[/math] with optimal dual solution [math]\displaystyle{ (w^∗, v^∗, u^∗) }[/math]
- Find learner [math]\displaystyle{ h_j ∈ Ω }[/math] satisfying the pricing problem. // Solve pricing problem.
- until [math]\displaystyle{ h_j }[/math] is not found
- Let [math]\displaystyle{ (\widetilde{λ} , \widetilde{z}) }[/math] be the final solution of the primal IP with base learners [math]\displaystyle{ \widetilde{L} = \{j | \widetilde{λ}_j \gt 0\} }[/math]
- if [math]\displaystyle{ \widetilde{z} ∈ \mathbb{Z}^N }[/math] and [math]\displaystyle{ \sum^{N}_{i=1}\widetilde{z}_i \lt U }[/math] then
- [math]\displaystyle{ U ← \sum^{N}_{i=1}\widetilde{z}_i, L^∗ ← \widetilde{L}, λ^∗ ← \widetilde{\lambda} }[/math] // Update best solution.
- else
- Choose [math]\displaystyle{ i ∈ [N] }[/math] with [math]\displaystyle{ \widetilde{z}_i \notin Z }[/math]
- Set [math]\displaystyle{ B_0 ← B ∩ \{z_i ≤ 0\}, B_1 ← B ∩ \{z_i ≥ 1\} }[/math]
- Add [math]\displaystyle{ (B_0,\widetilde{L}), (B_1,\widetilde{L}) }[/math] to [math]\displaystyle{ T }[/math]. // Create new branching nodes.
- end if
- end while
- Optionally sparsify final solution [math]\displaystyle{ L^* }[/math]
Results and Performance
References
- Pfetsch, M. E., & Pokutta, S. (2020). IPBoost--Non-Convex Boosting via Integer Programming. arXiv preprint arXiv:2002.04679.
- Freund, Y., & Schapire, R. E. (1995, March). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. pdf